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PREFACE 


This  report  represents  the  findings  of  a  fifteen  month  research  study  conducted 
at  the  University  of  Washington  under  the  auspices  of  the  Department  of  Health 
Services  (School  of  Public  Health  and  Community  Medicine)  and  the  Graduate 
School  of  Business.    While  this  study  dealt    primarily  with  the  problem  of 
developing  a  viable  procedure  for  classifying  the  nation's  short-term  general 
hospitals  for  use  in  a  hospital  prospective  reimbursement  system,  a  significant 
number  of  related  issues  and  topics  had  to  be  investigated  and  studied.  Among 
these  were  such  issues  as  the  economics  of    hospitals     and  prospective  reim- 
bursement, multivariate  statistical  methodologies  for  determining  and  assessing 
hospital  clusters,  and  computer  software  techniques  and  programs.  Extensive 
efforts  were  made  throughout  this  study  to  insure  that  all  questions  investi- 
gated were  studied  in  a  most  thorough  manner  possible;  our  analysis  was  based  on 
state-of-the-art  methods,  and  our  results  and  findings  were  based  on  the 
strongest  possible  foundation. 

With  any  research  study  of  this  type,  however,  advances  are  made  rapidly  and 
new  hypotheses  are  created  and  old  hypotheses  are  discarded  as  empirical  results 
dictate.     In  addition,  it  must  be  recognized  that  this  study  was  constrained 
to  an  investigation  of  the  immediate  questions  posed  by  the  funding  agency,  and 
limited  by  the  nature  of  the  data  base.     In  the  continuation  study  proceeding 
presently,  many  of  these  constraints  have  been  relaxed.     Therefore,  this  study 
should  be  viewed  as  a  forerunner  to  the  ongoing  studies  and  this  report  should 
be  considered  as  the  first  of  several  volumes. 

Project  Director  for  this  study  was  Professor  W.  L.  Dowling.  Co-investigators 
were  Professor  T.  D.  Klastorin  (who  directed  the  development  of  the  statistical 
methodology  and  empirical  investigations)  and  Professor  V.  Trivedi.  The 
Project  Associate  was  Professor  C.  A.  Watts,  who  directed  the  development  of 
the  general  economic  framework  described  in  Chapter  Two.     Mr.  R.  Lanier,  the 
Project  Research  Assistant,  conducted  much  of  the  empirical  analysis  with  the 
assistance  of  Mr.  R.  Flewelling,  who  wrote  most  of  the  computer  programs  for 
the  classification  analysis. 

The  SPSS  (Statistical  Package  for  the  Social  Sciences)  program  was  used  for 
most  of  the  standard  statistical  analyses,  including  the  factor,  regression, 
and  discriminant  analyses.  This  package  was  initially  developed  at  Stanford 
University  and  later  developed  by  the  National  Opinion  Research  Center  at  the 
University  of  Chicago.  The  current  version  used  in  this  study  was  developed 
by  the  Vogelback  Computing  Center,  Northwestern  University,  and  supported  by 
the  University  of  Washington  Computer  Center. 

The  authors  gratefully  acknowledge  the  helpful  comments,   suggestions,  and 
criticisms  received  from  Mr.  J.  Pettengill  during  the  course  of  this  study, 
and  the  secretarial  assistance  of  Ms.  C.   Sakai  and  Ms.  J.  Davis.     This  study 
was  supported  by  Contract  #600-76-0143  from  the  Office  of  Policy,  Planning, 
and  Research,  Health  Care  Financing  Administration,  U.   S.  Department  of  Health, 
Education,  and  Welfare. 
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CHAPTER  ONE 

INTRODUCTION  AND  OVERVIEW 
I.  Introduction 

The  rise  in  insurance  coverage  in  recent  years  has  been  accompanied  by  a 
tremendous,  and  perhaps  not  unrelated,  rise  in  hospital  expenditures.     In  the 
resulting  search  for  ways  to  reduce  costs,  third  parties  (including  the  federal 
government)  have  become  interested  in  the  concept  of  prospectively-determined  ^ 
payment  rates  as  a  replacement  for  reimbursement  on  the  basis  of  reported  costs. 
The  major  purpose  of  this  study  is  to  examine  a  method  of  hospital  classifica- 
tion which  might  be  used  as  part  of  a  prospective  reimbursement  system. 

The  impetus  for  a  prospective  system  comes  from  the  belief  that  some  fraction 
of  the  observed  cost  increase  is  "unjustifiable"    That  is,  rather  than  moving 
from  point  A  in  Figure  1.1  to  point  B  as  demand  responds  to  rising  income  and 
increased  insurance  coverage,  proponents  of  incentive  reimbursement  argue  that 
the  industry  has  instead  moved  from  point  A  to  point  C.     The  upward  shift  in 
the  long  run  average  cost  curve  (LRAC)  is  attributable,  according  to  this 
argument,  to  two  main  sources:     decreases  in  the  operating  (technical)  efficiency 
of  hospitals,  and  changes  in  the  nature  or  quality  of  the  output  (product  choice 
efficiency) .2,3 

It  is  argued  that  the  first,  operating  inefficiency,  has  been  encouraged  by 
the  payment  mechanism  by  which  providers  have  typically  been  reimbursed  for 
services  rendered.     Providers  determine  service  cost  retrospectively  and  are 
reimbursed  for  the  full  amount.     As  third  party  coverage  increases,  the  argu- 
ment goes,  this  open-ended  contract  reduces  the  incentive  for  hospitals  to 
operate  at  minimum  cost,  and  the  long  run  average  curve  moves  upward. 


It  is  interesting  to  note  that  the  original  Medicare  law  called  for 
reasonable  full-cost  reimbursement  with  an  additional  plus  factor  for  growth 
and  development.    Almost  immediately,  the  plus  factor  was  dropped;  somewhat 
later  the  wording  was  amended  to  read  reimbursement  of  "reasonable"  cost. 
Some  forms  of  prospective  reimbursement  (the  name  is  a  slight  contradiction 
of  terms)  can  be  seen  as  an  attempt  merely  to  agree  upon  what  constitutes 
"reasonable  cost"  before  the  fact. 

2 

Increases  in  input  prices  would  also  cause  the  curve  to  shift  upward. 

3 

Quantity  in  Figure  1.1  is  defined  in  terms  of  a  standard  unit  (e.g., 
hernia  equivalent)  to  avoid  measurement  problems. 
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FIGURE  1.1 

Health  Care  Cost  Curves 
A 

COST 


QUANTITY 
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There  is  some  disagreement  as  to  the  causes  of  the  second  source,  changes  in 
the  nature  or  quality  of  output.     It  is  possible  that  such  changes  are  the 
natural  responses  to  increased  standards  of  living,  increased  technology  (i.e., 
changes  in  the  range  of  products  that  can  be  demanded) ,  and  increased  insurance 
coverage. ^    However,  some  authors^  argue  that  many  of  the  changes  are  strongly 
influenced  by  providers — especially  physicians — if  not  actually  physician 
(or  hospital)  generated,  and  thus  are  not  reflections  of  consumers'  "true" 
preferences.     Given  the  level  of  consumer  knowledge  in  this  industry  and  the 
resulting  dependence  of  the  consumer  on  the  provider's  advice,  the  potential 
for  physician  and/or  hospital  influence  is  increased  as  insurance  lowers  the 
marginal  cost  to  the  patient  of  additional  services,  since  at  the  same  time  it 
reduces  the  benefit  (in  terms  of  dollar  savings  of  "unnecessary"  procedures  not 
performed)  of  acquiring  additional  information. 

A  system  of  prospectively  determined  rates  is  thought  to  address  these  two 
sources  of  cost  increases.     If  the  prospective  rate  puts  the  provider  at  risk 
for  production  costs  exceeding  the  agreed-upon  rate,  the  incentive  for  the 
institution  that  wishes  to  break  even  (or  earn  a  surplus)  is  to  hold  expendi- 
tures to  this  level  (or  below  if  surpluses  need  not  be  returned  to  the  payor) . 
To  the  extent  operating  inefficiencies  exist,  their  elimination  is  one  means 
of  reducing  expenditures. 

The  effect  of  incentive  reimbursement  on  changes  in  the  nature  of  output  depends 
upon  the  degree  to  which  output  is  provider  (hereafter  to  include  physicians 
as  well  as  hospitals)  influenced.     The  reimbursement  mechanism  can  affect  the 
output  vector  by  changing  the  incentives  facing  providers.     This  point  will  be 
discussed  more  fully  in  later  chapters. 

The  success  of  such  reimbursement  schemes  in  achieving  their  cost  containment 
goals  without  unduly  distorting  the  industry,  however,  depends  crucially  on  the 
answers  given  to  several  important  questions  concerning  the  design  of  the  reim- 
bursement program.     How  the  program  parameter  issues  are  resolved  will  shape 
the  incentives  facing  providers,  and  in  turn,  providers'  responses  to  the  program. 

The  first  decision  requires  determining  what  payment  unit  should  be  used.  That 
is,  should  the  rate  be  based  on  units  of  output:     actual  services  rendered 
(e.g.,  one  rate  for  lab  test  A,  another  for  x-ray  B,  another  for  drug  C,  etc.), 
days  of  care  (e.g.,  all-inclusive  per  diem  rate),  or  admissions  (all-inclusive 
or  diagnostic  specific  rates);  or  should  it  be  based  on  units  of  time:  yearly 
(monthly)  departmental  or  institutional  budgets?    The  choice  of  this  parameter 
has  a  direct  impact  on  the  issue  of  provider  influence  of  demand  since  it 
determines  to  a  large  extent  the  financial  desirability  to  the  hospital  of 
changing  the  output  vector. 


Undesirable  changes  in  consumer  demand  brought  about  by  insurance  through 
the  presence  of  moral  hazard  are  best  addressed  by  changing  the  nature  of  the 
insurance  contract.     Altering  the  provider  payment  mechanism  will  have  little 
impact  on  the  consumer-generated  demand  curve. 

"*See,  for  example,  Feldstein  (1974). 
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The  second  decision  requires  determining  which  costs  should  be  included  in 
the  determination  of  the  prospective  rates.     That  is,  should  rates  cover  all 
operating  expenses,  or  should  they  apply  only  to  selected  pieces  of  the  hospi- 
tal's operation?    For  example,  in  its  payment  of  medicare  claims,  the  Social 
Security  Administration  currently  applies  a  pre-determined  ceiling  only  to 
routine  costs,  leaving  all  other  costs  (ancillary  services,  drugs,  etc.)  pay- 
able retrospectively.     The  choice  of  this  parameter  is  important  if  hospitals 
are  to  be  prevented  from  escaping  effective  control  by  reallocating  costs  or 
from  distorting  the  output  vector  by  reallocating  effort  to  non-covered 
services  or  components. 

The  final  question  concerns  determination  of  the  set  of  hospitals  upon  which 
a  given  rate  will  be  imposed.     That  is,  should  each  provider  face  a  different 
rate;  should  all  providers  face  the  same  rate;  or  should  different  rates  be 
set  for  distinct  subgroups  of  the  provider  population?     Since  prospective  rates 
function  as  simulated  prices,  it  is  important  that  a  proper  response  be  given 
to  this  question  to  avoid  giving  improper  signals  to  providers. 

Failure  to  give  careful  consideration  to  each  of  these  questions  (summarized 
in  Table  1.1)  casts  doubts  on  the  desirability  and  efficacy  of  a  system  of 
incentive  reimbursement.     This  study  focuses  on  an  examination  of  the  third 
question:     the  criteria  by  which  hospitals  should  be  placed  in  subgroups  for 
purposes  of  prospective  rate  determination. 6  However,  the  three  questions  are 
interrelated  to  the  extent  that  a  discussion  of  the  third  cannot  be  adequately 
carried  out  without  references  to  the  first  and  second.     Therefore,  while  the 
first  two  questions  are  not  directly  within  the  scope  of  this  study,  some 
attention  will  be  given  to  these  points  as  they  relate  to  the  grouping  issue. 

TABLE  1.1: 

Elements  of  a  Prospective  Reimbursement  System 


1.  Payment  Unit 

Per  Service 
Per  Day 
Per  Admission 
Per  Unit  Time 

2.  Included  Costs 

Total  Costs 
Routine  Costs 

3.  Coverage  of  Rates 

Each  Hospital 
All  Hospitals 
By  Subgroups 


In  addition,  the  classification  of  hospitals  is  a  crucial  aspect  of  many 
performance  evaluation  systems,  as  suggested  by  several  proposals  for  esta- 
blishing a  formal  system  of  National  Health  Insurance  (e.g.,  the  Mclntyre  Bill 
S.1100,  or  the  Kennedy-Mills  Bill  -  S.3286  and  H.R.  13870). 
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A.     Previous  Studies 

Although  hospital  classification  has  been  frequently  discussed  in  relation  to 
various  incentive  reimbursement  schemes,  there  has  been  only  a  limited  amount 
of  empirical  work  done  to  date.     One  characteristic  study,  performed  by  Berry 
(1973),  used  data  from  the  American  Hospital  Association  (AHA)  and  a  sample 
of  4,814  hospitals.     Berry  analyzed  the  frequency  distribution  of  services  and 
facilities  within  hospitals  and  found  five  distinct  groups  of  hospitals  based 
on  the  range  of  services  provided  which  extended  from  the  most  basic  services 
provided  in  small  rural  hospitals  to  the  most  complex  services  provided,  in 
large  metropolitan  medical  centers  (Berry's  labels  are:     Basic,  Quality  Enhan- 
cing, Complex,  Community,  and  Special).     Significant  differences  were  found  to 
exist  in  length  of  stay,  per  diem  cost,  and  occupancy  rates  among  these  groups 
of  hospitals. 

More  recent  studies  have  incorporated  a  larger  spectrum  of  variables  for  hospi- 
tal classification.     Phillip  and  Iyer  (1975)  classified  5,700  hospitals  from 
the  AHA  Annual  Survey  of  Hospitals,  using  a  total  of  seventeen  "product 
characteristic"  variables  (for  example,  number  of  beds,  number  of  RN's  and  LPN's, 
etc.).     Seventy-one  hospital  groups  were  generated  from  the  sample  using  cluster 
analysis  techniques. 

Another  study,  performed  by  Trivedi  (1977),  used  four  categories  of  variables 
including  hospital  demand  and  supply  measures,  measures  of  composition  (mix) 
of  hospital  output  and  measures  of  quantity  of  output.     The  study  was  performed 
for  94  short-term  general  hospitals  in  Washington  State  and  the  five  groups 
generated  are  currently  used  by  that  state  in  its  hospital  rate  review  process. 

The  primary  limitation  of  these  studies  is  that  little  attention  is  given  to 
the  selection  of  the  classification  variables,  which  is  of  utmost  importance 
to  the  success  of  any  payment  system  based  on  the  resultant  groups.  Further- 
more, the  determination  of  an  appropriate  statistical 

classification  methodology  is  nowhere  carefully  considered.     It  is  the  aim  of 
this  study  to  begin  to  attempt  to  overcome  these  shortcomings. 

II.     Classification  Methodologies 

The  goal  of  any  classification  analysis  is  to  analytically  classify  hospitals 
in  such  a  way  as  to  simultaneously  maximize  hospital  homogeneity  within  groups 
and  hospital  heterogeneity  between  groups  (i.e.,  to  group  hospitals  such  that 
hospitals  in  the  same  group  are  more  alike  than  hospitals  in  different  groups). 
However,  if  homogeneous  clusters  of  hospitals  are  not  to  be  defined  by  an 
arbitrary  process,  the  method  used  must  resolve  a  number  of  crucial  questions 
in  a  statistically  satisfactory  manner.     Among  these,  the  most  important  ques- 
tions which  must  be  considered  by  any  classification  methodology  are  listed 
below: 

1.     How  can  homogeneity  among  hospitals  be  precisely  and  mathematically 
defined? 


2.     How  many  groups  should  there  be? 
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3.  How  should  weights  on  each  variable  be  determined? 

4.  How  should  the  tradeoff  between  the  number  of  groups  and  overall 
homogeneity  be  made;  i.e.,  if  more  or  fewer  groups  are  desired,  how 
should  groups  be  combined  or  split  in  order  to  meet  the  desired 
criterion? 

5.  How  can  resultant  clusters  be  validated? 

In  addition  to  providing  answers  to  these  questions,  any  viable  grouping 
methodology  must  meet  the  usual  statistical  criteria  of  efficiency,  sufficiency, 
and  consistency. 

While  a  number  of  methodologies  have  been  proposed  for  classifying  multi- 
variate data  (i.e.,  hospitals  described  by  a  vector  of  characteristics),  each 
has  one  or  more  limitations  with  respect  to  the  questions  and/or  criteria 
proposed  above.     Peterson  (1971)  points    out  that  many  of  these  methodologies 
exist  on  a  continuum — at  one  end  are  the  highly  pragmatic  techniques  of 
multiple  cross-classification,  and  at  the  other  end  there  are  clustering 
techniques  based  on  global  optimization  criterion.     As  illustrated  in  Figure  1.2, 
the  techniques  of  regression  analysis,  factor  analysis,  and  heuristic  cluster 
analysis  lie  between  the  two  extremes. 

FIGURE  1.2: 

Classification  Methodologies 

Classification 


Multiple  Factor        Regression        Cluster  Cluster 

Cross-Classification        Analysis        Analysis         Analysis  Analysis 

(Heuristic)  (Optimization) 

The  most  pragmatic  technique  of  multiple  cross-classification  is  illustrated 
by  the  present  Health  Care  Financing  Administration  (Medicare  Program)  system 
of  classifying  hospitals  on  the  basis  of  three  dimensions:^  urban-rural 
designation,  state  per  capita  income,  and  hospital  size  in  bed  capacity.  The 
limitations  of  such  a  scheme  are  readily  apparent.     First,  since  it  is  neces- 
sary to  define  variables  by  interval  (as  opposed  to  the  precise  value) ,  useful 
information  is  lost.     Further,  the  designation  of  the  interval  range  to  be  used 
must  be  arbitrary  and  cannot  be  tested  against  an  alternative  range.  Finally, 
adding  additional  measurements  will  drastically  increase  the  number  of  associated 
classification  cells  and  the  number  of  clusters,  thus  reducing  the  possible 
number  of  hospitals  in  each  cluster.     With  the  exception  of  the  sufficiency 
criteria,  it  is  apparent  that  such  a  scheme  fails  to  satisfactorily  resolve  the 
questions  stated  above. 


For  a  more  complete  discussion  of  the  limitations  inherent  in  the  Social 
Security  Administration's  present  classification  system,  see  Phillip  and  Iyer, 
1974;  and  Pointer  and  Phillip,  1974. 
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Regression  analysis  has  infrequently  been  proposed  as  a  clustering  tool.  Having 
selected  relevant  independent  variables,  a  regression  equation  can  be  determined 
for  a  single  dependent  variable  (presumably,   in  this  case  some  function  of 
hospital  cost).     Subsequently,  the  dependent  variable  values  predicted  by  the 
resultant  equation  can  be  used  to  determine  clusters;   since  the  classification 
problem  has  been  reduced  to  a  unidimensional  problem,  properties  and  procedures 
determined  by  Fisher  (1958)  for  finding  optimal  clusters  could  be  employed 
for  hospital  classification.     The  major  limitations  of  such  an  approach  are 
threefold:     (1)  the  determination  of  ultimate  hospital  groups  is  based  upon 
the  assumptions  of  linearity  and  additivity  (or  whatever  model  was  used  in  the 
regression  analysis)  and  might  not  reflect  the  true  underlying  relationship, 
(2)  the  variables  are  assumed  to  be  normally  distributed,  and   (3)  all  hospital 
groups  are  assumed  to  have  the  same  cost  structure.     In  addition,  the  dependent 
variable  (i.e.,  cost)  is  exceedingly  costly  to  measure  and,  even  if  measured, 
most  likely  would  not  reflect  efficient  cost  levels.     It  would  seem  more  prudent 
to  determine  hospital  clusters  from  the  independent  variables  alone,  and  sub- 
sequently retain  the  capability  to  develop  a  reimbursement  scheme  on  whatever 
statistic  or  formula  seems  most  appropriate. 

The  use  of  factor  analysis  to  classify  hospitals  suffers  from  the  same  weak- 
nesses inherent  in  regression  analysis — the  relationship  among  the  variables 
is  constrained  to  be  linear,  all  hospitals  are  assumed  to  face  similar  cost 
structures,  and  the  variables  are  assumed  to  be  normally  distributed.     The  use 
of  this  technique,  however,  for  preprocessing  variables  prior  to  hospital 
classification  is  discussed  in  Chapter  Three. 

The  last  techniques  shown  in  Figure  1.2  are  those  of  heuristic  and  optimization 
cluster  analysis.     These  procedures  usually  combine  the  vectors  of  hospital 
characteristics  into  a  single  composite  measure  between  all  pairs  of  hospitals 
(called  a  similarity  measure)  which  represents  each  hospital  pair's  affinity, 
and  subsequently  use  this  similarity  measure  to  develop  some  set  of  reasonable 
clusters. 

In  addition  to  the  methodologies  indicated  in  Figure  1.2,  there  are  other 
multivariate  statistical  procedures  related  to  the  classification  problem.  For 
example,  discriminant  analysis  assumes  that  populations  have  been  identified 
beforehand  and,  as  most  commonly  used,  finds  a  linear  function  which  attempts 
to  explain  the  maximum  difference  between  populations.     While  discriminant 
analysis  might  theoretically  be  applied  to  all  possible  partitions  of  hospitals, 
the  enormous  number  of  possible  groups  obviously  precludes  such  effort  (this 
is  discussed  in  further  detail  in  Chapter  Three). 

Another  technique  for  finding  homogeneous  clusters  is  the  Automatic  Interaction 
Detector  (AID),  initially  proposed  by  Morgan  and  Sonquist  (1963).     Like  multiple 
regression  analysis,  AID  studies  the  relationship  between  a  dependent  variable 
and  a  set  of  independent  variables:     however,  unlike  multiple  regression  analy- 
sis, it  does  not  require  a  predetermined  functional  form.     AID  continually 
splits  the  data  into  subgroups  on  the  basis  of  independent  variables,  but  the 
"quality"  of  the  subgroups  is  measured  by  the  sum  of  squared  deviations  in  the 
dependent  variable.     AID  offers  a  number  of  advantages,  among  them  being  a  well 
defined  algorithm  and  no  a  priori  assumption  of  a  functional  form.     In  addition, 
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it  is  well  suited  to  large  data  sets  when  ;he  number  of  independent  variables 
is  fairly  small  (since  the  number  of  potential  branches  on  the  AID  "tree"  is 
determined  by  the  number  of  independent  variables  and  the  number  of  discrete 
intervals  of  each).     Besides  the  restrictions  of  the  number  of  variables,  the 
major  limitation  for  this  study  is  AID's  use  of  a  dependent  variable;  as  pre- 
viously noted,  outcome  measures  are  not  theoretically  tractable  (e.g.,  efficient 
costs  would  be  exceedingly  difficult  to  measure).     Other  limitations,  suggested 
by  Doyle  and  Fenwick  (1975),  include  AID's  tendency  to  inaccurately  classify 
objects  if  the  dependent  variable  is  heavily  skewed,  instabilities  in  the 
resultant  tree  due  to  differences  in  the  predetermined  intervals  for  each  in 
dependent  variable,  and  ambiguities  relating  to  the  algorithm's  stopping  rules. 

A  comparison  of  six  state-of-the-art  classification  methodologies  is  presen- 
ted in  Table  1.2.  It  is  evident  from  this  comparison  that  the  most  appropriate 
methodology  which  explicitly  incorporates  a  measure  of  similarity  between 
hospitals  (i.e.,  explicit  measure  of  a  hospital  pair's  respective  affinity), 
does  not  require  assumptions  regarding  underlying  distribution  of  the  multi- 
variate data,  and  most  easily  accommodates  large  scale  data  sets,  is  cluster 
analysis:     the  methodology  of  choice  in  this  study. 

III.     Report  Outline 

The  remainder  of  this  report  is  organized  as  follows:     Chapter  Two  develops 
a  general  economic  framework  for  selecting  classification  variables. 
Empirical  measures  from  the  data  base  are  selected  to  represent  the  variables 
in  the  classification  process  itself.     The  statistical  methodology  developed 
for  the  clustering  problem  is  explained  in  Chapter  Three.     This  chapter  includes 
detailed  discussions  of  similarity  measure  calculation,  group  determination, 
and  validation  and  testing  of  the  resultant  clusters.     The  fourth  chapter  pre- 
sents empirical  findings  from  the  data  base  of  1,070  randomly  selected  short- 
term  general  hospitals,  including  descriptions  of  the  clusters  found  for  both 
the  complete  sample  and  a  subsample  of  194  hospitals.     The  summary  and  conclu- 
sions are  presented  in  Chapter  Five,  and  listings  of  all  computer  programs  are 
included  in  Appendix  C. 
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CHAPTER  TWO 


ECONOMIC  RATIONALE  AND  VARIABLE  SELECTION 

I.  Introduction 

Chapter  One  outlined  the  various  classification  methodologies  that  are  avail- 
able and  discussed  their  strengths  and  weaknesses  from  a  statistical  perspective. 
However,  even  the  most  appropriate  methodology  will  yield  misleading  results 
unless  the  variables  chosen  for  use  in  the  classification  process  are  also 
correct.     That  is,  while  the  choice  of  methodology  determines  the  path  leading 
from  the  inputs  to  the  conclusion,  the  conclusion  itself  is  determined  by  the 
choice  of  inputs:     the  classification  variables. 

It  is  interesting  to  note  that  in  spite  of  its  importance,  the  issue  of  vari- 
able selection,  unlike  the  issue  of  methodology  selection,  has  received 
practically  no  attention  in  the  literature.     In  none  of  the  studies  mentioned 
above  was  there  a  formal  development  of  the  variable  selection  criteria  from 
a  conceptual  basis.     That  is  the  task  of  this  chapter.     A  conceptual  framework 
which  suggests  the  variables  that  should  be  selected  will  be  developed  and 
applied  to  the  problem  and  data  set  at  hand. 

II.  Conceptual  Framework 

A.     Grouping  and  Prospective  Reimbursement 

It  was  pointed  out  in  the  introduction  that  perhaps  a  large  share  of  the  cost 
increases  of  the  last  decade  have  stemmed  from  the  fact  that  cost  reimburse- 
ment relaxes  the  market  constraint  from  the  insured  portion  of  the  hospital's 
business.     The  objective  of  prospective  reimbursement  is  to  fill  the  void  by 
approximating  the  constraints  of  a  properly  functioning  market,  thereby 
inducing  hospitals  to  operate  more  efficiently  both  in  the  sense  of  technical 
efficiency  and  in  the  composition  and  nature  of  the  output  they  produce. 

Price  is  the  constraint  of  the  market.     That  is,  firms  face  a  price  for  the 
goods  they  produce:     a  price  that  is  determined  by  aggregate  market  forces 
and  reflects  the  cost  of  production.     In  competitive  situations,  prices  are 
not  subject  to  the  influence  of  any  single  firm.     Thus,  the  setting  of  an 
appropriate  prospective  rate  should  be  analogous  to  the  determination  of  an 
optimal  market  price. 

The  classification  of  hospitals  into  subgroups  for  the  purpose  of  rate 
determination  is  merely  recognition  that,  as  in  other  industries,  optimal 
market  prices  may  differ  across  producers  as  market  conditions  differ.  Thus, 
a  single  rate  for  all  hospitals  is  not  appropriate:     prospective  rates  should 
differ  across  hospitals  as  market  conditions  differ.     The  objective  of  group- 
ing is  to  identify  the  market  conditions  that  would  lead  to  price  variation 
in  a  competitive  market  and  to  classify  hospitals  into  groups  based  on  their 
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similarity  with  respect  to  these  criteria.     The  implication  is  then  that  cost 
variation  observed  within  these  groups  arises  from  some  other  "unjustified" 
source  and  should  not  be  recognized  in  the  rate  structure.^ 

The  importance  of  these  grouping  criteria  to  the  success  of  the  reimbursement 
scheme  cannot  be  over-emphasized.     To  avoid  imposing  unnecessary  and  inappro- 
priate hardship  on  providers,  account  must  be  taken  of  differences  in  exogenous 
sources  of  cost  variation.     Failure  to  adjust  for  these  differences  will  result 
in  a  system  that  is  inequitable  and  arbitrary,  resulting  in  serious  distortions 
in  the  short  run,  and  a  movement  of  providers  out  of  the  industry  in  the  long 
run.     On  the  other  hand,  adjusting  rates  for  cost  differences  arising  from  other 
sources  allows  "leakage"  in  the  control  mechanism  (i.e.,  allows  firms  to  escape 
the    constraint)  and  may  provide  incentives  for  distorting  the  system  further. 
There  is  substantial  evidence  from  other  industries  that  firms  can  and  do 
adjust  to  controls  by  altering  uncontrolled  aspects  of  their  operations.  In 
this  way,  the  effect  of  the  constraint  on  the  firm's  ability  to  pursue  its 
objectives  is  minimized.     For  example,  the    airline   industry  has  responded  to 
the  imposition  of  price  controls  by  changing  an  unconstrained  characteristic 
of  the  output  vector  (service  frills)  that  it  offers  for  a  given  price.  Public 
utilities,  facing  a  ceiling  on  the  rate  of  return  to  capital  they  may  earn, 
have  increased  their  capital  stock  in  order  to  be  allowed  to  earn  more  absolute 
profit  dollars.     (See  Noll,  1973). 

This  list  goes  on.     To  avoid  creating  like  incentives  in  the  hospital  industry, 
the  selection  of  criterion  variables  used  to  group  hospitals  must  receive  as 
careful  attention  as  the  actual  mechanism  through  which  controls  are  imposed. 

B.     Cost  Influencing  Factors 

Since  prospective  rates  are  to  serve  as  producer  prices,  the  appropriate  place 
to  turn  for  insight  into  the  selection  of  proper  grouping  variables  is  the 
theory  of  the  market.     Economic  theory  suggests  that  in  a  world  of  perfectly 
competitive  profit-maximizing  firms,  the  prices  of  goods  and  services  observed 
in  any  market  reflect  the  opportunity  cost  of  the  resources  used  in  producing 
those  goods  and  services.     Further,   since  firms  are  motivated  by  profits,  their 
incentive  is  to  produce  exactly  that  combination  of  goods  and  services  demanded 
by  consumers,  given  tastes,   incomes  and  prevailing  prices  for  factors  of  produc- 
tion (which,  given  the  assumptions  of  this  section,  represent  marginal  costs)  . 
In  this  situation  all  firms  are  driven  through  competition    to  a  single  price 
for  a  given  output  (adding  the  assumption  of  adequate  information  flow),  and 
similar  goods  of  different  quality  will  exist  at  different  prices  only  to  the 
extent  that  consumer  preferences  dictate  these  differences  (i.e.,  imported 
French  wine  will  be  purchased  at  higher  prices  than  some  domestics,  but  the 
market  for  gold  lunchboxes  is  small). 

Thus,  prices  &nd  costs)  vary  across  firms  because  of  output  differences. 
However,   since  output  has  a  number  of  dimensions,  cost  differences  can  arise 


It  should  be  noted  that  variable  definition  and  measurement  problems 
reduce  the  accuracy  of  this  proposition  in  practice. 
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when  any  aspect  of  output  differs  across  firms.     For  example,  even  in  an  in- 
dustry of  single  product  firms,   there  may  exist  differences  in  the  nature  of 
the  product,  even  though  it  is  given  only  one  name.     That  is,  hospitals  may 
have  different  "efficient"  cost  curves  if  one  institution's  appendectomy 
patients  are  diabetic  and  obese  while  another  hospital  provides  appendectomy 
to  otherwise  healthy  patients  without  complications. 

Further,   there  may  be  quality  differences  leading  to  cost  variations.     If  the 
wine  that  is  aged  for  several  years  in  wooden  casks  is  of  higher  quality  than 
a  brew  marketed  from  steel  vats  after  one  year,  its  higher  costs  will  be 
covered  by  higher  prices  as  long  as  consumers  perceive  (and  are  willing  to  pay 
for)  the  difference  in  quality.     The  appropriate  analogy  in  the  hospital  indus- 
try might    also  focus  on  outcomes.     If  like  patients  with  a  given  condition  in 
hospital  A  have  a  lower  rate  of  mortality  and  morbidity  following  admission 
than  do  those  in  hospital  B,  patients  may  be  willing  to  pay  a  higher  rate  to  be 
treated  in  the  former.     Some  may  also  be  willing  to  pay  more  for  nicer  surroun- 
dings, added  comfort  or  more  privacy. 

In  a  multiple-product  firm,   the  situation  becomes  more  complicated.  Ideally, 
variable  costs  are  allocated  to  the  various  outputs  and  rates  are  set  according 
to  marginal  production  costs.     Where  this  is  the  case,  the  preceding  arguments 
hold.     The  marginal  cost  of  producing  a  Volkswagen  tune-up  of  a  given  quality 
should  be  the  same  across  repair  shops,  regardless  of  the  other  services  that 
may  also  be  available  in  any  shop   (abstracting  as  before  from  exogenous  factors) 
The  same  is  true  of  the  inpatient  setting:     a  diagnostic  chest  x-ray  given  in 
a  teaching  institution  should  be  no  different  nor  more  costly  than  the  same 
procedure  performed  in  another  type  of  inpatient  facility.     The  parallel  in 
this  instance  is  less  exact,  however,   since  the  availability  of  some  specialized 
equipment  and  personnel  may  affect  the  probability  of  experiencing  a  given  out- 
come.    That  is,  a  delivery  may  take  no  more  resources  in  a  large  hospital  if  the 
delivery  is  normal,  but  the  outcome  of  a  complicated  birth  might  be  enhanced 
by  the  presence  of  backup  facilities  (e.g.,  post-natal  intensive  care  units, 
premature  nurseries,  etc.)  unavailable  in  a  smaller  institution.  Therefore, 
a  woman  who  believed  the  probability  of  complications  in  her  delivery  to  be 
large  might  well  be  willing  to  pay  the  higher  rate  of  the  larger  hospital,  even 
if  the  birth  turned  out  to  be  normal,  because  of  the  reduction  of  risk.  Over 
time,   if  these  differences  are  indeed  reflected  in  charges  (and  if  patients 
pay  their  own  charges),   this  suggests  that  women  who  expect  normal  deliveries 
will  gravitate  to  the  smaller  (less  costly)  hospital,   leaving  the  larger  insti- 
tution with  a  more  complex  mix  of  deliveries  and  an  appropriately  higher 
average  daily  cost. 

Thus,  price  differences  within  a  market  in  this  setting  are  observed 

only  to  the  extent  there  exist  differences  in  the  dimensions  of  output.  Further 

these  output  and  quality  differences  reflect  the  exogenous  preferences  of 

consumers. 

Price  differences  for  a  given  kind  and  quality  of  output  may  exist  across  market 
in  this  setting  only  if  there  are  input  market  imperfections  leading  to  long 
run  differences  in  factor  prices  across  markets,  or  local  regulations  that  have 
an  impact  on  cost  (if  lower  priced — gross  of  shipping  charges — goods  from  other 
markets  cannot  be  transported  in)   (See  Table  2.1). 
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TABLE  2.1: 


Major  Sources  of  Price  Variation  Among  Competitive  Firms 


1. 
2. 
3. 
4. 
5. 
6. 


Nature  of  Output 

Differences  in  Quality  of  Output 

Output  mix  for  Multiproduct  Firms 

Input  Price  Differences 

Local  Regulations 

Scale  Economies  Across  Markets 


An  important  point  to  note  is  that  in  this  setting  characterized  by  perfect 
competition  and  adequate  information  flow,  differences  in  market-generated 
prices  spring  only  from  forces  exogenous  to  the  individual  firm.     Price  differ- 
ences arising  from  differences  in  endogenous  factors  such  as  technology  employed, 
age  or  size  of  plant,  or  input  mix  will  not  be  sustained  in  the  long  run:  firms 
that  cling  to  their  endogenous  differences  will  be  forced  out  of  business  by  the 
lower  cost  competition. 

C.     Implications  for  the  Hospital  Industry 

What  does  this  theory  imply  for  the  determination  of  prospective  rates  in  the 
hospital  industry?    First,  one  must  question  whether  the  assumptions  that  lead 
to  the  above  conclusions  hold  in  the  market  for  hospital  services.     If  so,  then 
it  follows  that  the  level  of  cost  in  the  industry  is  appropriate,  the  observed 
output  mix  is  appropriate,  and  prospective  rates  reflecting  these  costs  are 
appropriate.     In  this  situation,  hospitals  should  be  grouped  on  the  basis  of 
observed  cost;  hospitals  with  dissimilar  costs  should  be  in  different  groups. 

Unfortunately,  the  assumptions  do  not  hold.     While  perfect  competition  may  be 
an  economist's  fantasy,  a  strong  argument  could  be  made  that  even  "adequate" 
competition  (adequate  in  the  sense  of  allowing  a  close  approximation  to  the 
previously  outlined  conclusions)  does  not  exist  in  most  hospital  markets. 
Barriers  to  entry  in  the  form  of  capital  availability,  certificate  of  need  laws, 
and  AMA  restrictions  on  the  supply  of  an  essential  complementary  good  (physi- 
cians) act  to  limit  the  forces  of  competition.     In  addition,  the  presence  of 
insurance  on  a  wide  scale  further  relaxes  the  ordinary  market  constraints.  As 
the  point  of  hospital  service  price  approaches  zero,  the  price  sensitivity  of 
consumers  is  greatly  reduced,  and  demand  increases.     This  situation  is  exacer- 
bated by  fee  for  service  payment  of  physicians  and  cost  reimbursement  of 
inpatient  facilities,  both  of  which  further  encourage  the  expansion  of  output. 

Further,   information  is  costly  in  this  industry.     It  was  noted  in  Chapter  One 
that  the  lack  of  knowledge  about  the  product  leaves  the  consumer  dependent  on 
the  provider,  facilitating  the  influence  or  generation  of  demand  by  physicians 
and/or  hospitals.     Further,  information  regarding  prices  and  insurance  coverage 
is  not  always  readily  available.     These  two  facts  increase  the  opportunity  for 
non-competitive  behavior  among  firms. 


14 


The  implication  of  dropping  the  assumptions  of  perfect  competition  and  free 
information  flow  is  that  market-generated  prices  can  no  longer  be  assumed  to 
reflect  "true"  minimum  costs  since  there  may  be  a  margin  for  monopoly  rents. 
Without  perfect  competition,  price  differences  reflective  only  of  differing 
degrees  of  monopoly  power  may  be  sustained  over  time  within  one  market  (the 
example  of  monopolistic  product  differentiation  is  applicable  here)  or  across 
markets.     In  this  situation,  taking  market  prices  as  optimal  for  use  in  a  rate 
determination  scheme  is  inappropriate. 

Another  problem  is  that  most  firms  in  this  industry  are  legally  non-profit 
entities.     Thus,  since  hospitals  may  not  be  attempting  to  maximize  profits,  it 
is  no  longer  clear  as  it  is  in  other  industries  that  their  objectives  will  be 
furthered  by  producing  the  combination  of  services  (output)  most  valued  by 
consumers.     Given  the  insurance-induced  price  insensitivity  of  consumers, 
firms  have  no  valid  signals  as  to  consumer  preferences.     This  fact,  combined 
with  the  other  barriers  to  perfect  competition,  means  that  the  opportunity  of 
the  non-profit  firm  to  pursue  its  own  objectives  without  fear  of  financial 
losses  is  increased.     If  these  objectives  are  furthered,  for  example,  by 
expanding  the  production  of  one  particular  type  of  service  or  increasing  the 
quality  of  output  beyond  the  level  indicated  by  consumer  preferences,  an 
inefficient  output  mix  will  be  produced  by  the  industry.     This  situation  is 
exacerbated  to  the  extent  that  providers  can  influence  consumers'  purchase 
decisions . 

Neither  is  there  the  assurance  of  technical  efficiency  in  the  sense  of  pro- 
ducing the  largest  amount  of  services  for  a  given  set  of  inputs,  an  assurance 
guaranteed  in  a  profit  maximization  setting,  since  the  value  to  the  non-profit 
firm  of  the  profits  lost  through  this  type  of  inefficiency  is  likely  to  be 
small  (unless  profits  may  be  accumulated  for  investment  in  projects  that 
yield  utility)    (Clarkson,  1972).     There  is  also  the  possibility,  as  suggested 
by  some  authors,  that  specific  inputs  rather  than  outputs  enter  the  hospital's 
objective  function  (Lee,  1971).     Thus,  the  expectation  of  technical  efficiency 
is  further  reduced.     Feldstein  (1971)  goes  a  step  further  to  suggest  the 
inclusion  of  selected  input  prices  in  the  vector  of  objectives,  again  implying 
distortions  in  the  input  mix. 

Thus,  observed  interfirm  price  and  output  differences  in  this  largely  non- 
profit market  characterized  by  imperfect  competition  and  provider  influence 
of  consumption  decisions  may  not  be  assumed  to  reflect  consumer  preferences 
expressed  in  light  of  relative  production  costs.     It  may  be  more  likely  that 
they  reflect  (at  least  to  some  extent)  differences  in  individual  firm  objectives 
or  differences  in  the  level  of  insurance  coverage.     Thus,  since  incentive 
reimbursement  systems  are  superimposed  on  an  industry  that  has  been  and  continues 
to  be  heavily  insured  and  primarily  non-profit  oriented,  this  result  implies 
that  rates  set  on  the  basis  of  historical  cost  and  the  firm's  status  quo  may  be 
inappropriate.     This  suggests  that  grouping  hospitals  according  to  observed 
costs  and  Setting  a  separate  rate  schedule  for  each  group   (which  probably 
amounts  to  separate  rates  for  each  hospital)  will  only  perpetuate  existing 
inefficiencies . 
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How  then  are  appropriate  groups  to  be  determined?     The  previous  examination 
of  the  perfectly  competitive  situation  would  indicate  that  hospitals  should 
have  different  costs  (and  therefore  face  different  prices  or  rates)  only 
when  they  differ  with  respect  to  the  exogenous  characteristics  outlined  in 
that  discussion:     input  prices,  product  mix  (in  multiple  product  firms), 
quality  and  nature  of  output. 

This  point  is  very  important.     Allowing  a  hospital  to  move  to  a  group  with 
higher  rates  because  of  a  change  in  an  endogenous  factor  encourages  other 
hospitals  to  change  this  factor  purposefully  to  increase  their  own  rates. 
Therefore,  an  appropriate  grouping  scheme  should  be  based  on  variables  that 
cannot  be  manipulated  by  the  institutions'  administrators. 

The  task  of  the  next  section  is  to  address  more  specifically  the  grouping 
variables  that  might  be  used  in  the  hospital  industry  in  light  of  the  theore- 
tical considerations. 


III.     The  Variables 

While  the  theory  of  the  market  presented  in  Section  II  provides  some  insight 
into  the  exogenous  sources  of  cost  differences  among  efficient  firms,  the 
translation  of  this  theory  into  the  specific  variables  that  should  be  used  to 
group  hospitals  for  rate  setting  cannot  in  general  be  made  without  reference 
to  other  program  parameters  as  outlined  in  Chapter  One. 

That  is,   the  subset  of  appropriate  grouping  variables  (within  the  set  of 
variables  identified  by  the  foregoing  theoretical  considerations)  will  differ 
with  the  unit  of  payment  that  is  chosen  (e.g.,  total  budget,  per  case — 
diagnostic  specific  or  all-inclusive,  per  diem,  or  per  service),  and  the 
costs  that  are  to  be  included  for  control  (e.g.,  total  costs,  routine  costs, 
etc.  ) . 

The  scope  of  the  program  in  terms  of  its  specific  objectives  will  also  influence 
the  choice  of  grouping  variables.     This  point  relates  to  the  discussion  in 
Chapter  One  of  the  two  types  of  inefficiencies  that  are  thought  to  exist  in 
the  industry:     technical  inefficiency  and  product  choice  inefficiency.     If  the 
control  program  limits  its  focus  to  the  first  type,  the  grouping  criteria  are 
somewhat  different  than  in  a  more  ambitious  program  that  attempts  to  control 
both  sources  of  cost  increase. 

The  impact  of  program  objectives  will  be  discussed  first,  followed  by  an  exam- 
ination of  the  grouping  variables  appropriate  to  the  various  program  designs. 

A.     Program  Objectives  and  Variable  Selection 

In  Section  II  it  was  concluded  that  differences  in  the  nature  and  quality  of 
output  are  among  the  prime  exogenous  sources  of  cost  variation  among  firms. 
However,   it  was  also  suggested  that  these  variables  may  not  be  determined 
exogenously  in  the  industry  in  question.     That  is,  a  consumer  who  does  not 
have  adequate  information  concerning,  for  instance,  the  impact  of  various 
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procedures  on  his  health  for  a  given  condition  must  rely  on  the  providers  of 
health  services  to  direct  his  medical  purchases.     This  fact  combined  with  the 
fact  that  the  consumer  likely  faces  below  cost  service  prices  because  of 
insurance  suggests  that  physicians  and/or  hospitals  can  direct  the  consumption 
of  services  both  in  number  and  in  kind. 

It  is  important  to  note  that  the  extent  to  which  "extra"  or  overly  specialized 
services  will  be  suggested  by  a  provider  depends  in  part  upon  the  firm's  ob- 
jective function.     If  the  firm  is  a  pure  profit  maximizer,  it  will  direct  an 
increase  in  consumption  of  its  most  profitable  services.    A  regulatory  mechan- 
ism that  is  tight  enough  to  eliminate  economic  profit  from  all  services  would 
effectively  eliminate  this  distortion  since  nothing  would  be  gained  by  such 
actions.     However,  if  the  firm  is  maximizing  some  non-profit  objectives  such 
as  quantity  of  services  or  the  production  of  a  subset  of  services  (e.g.,  highly 
specialized  procedures) ,  a  break-even  reimbursement  rate  will  not  curb  the 
provider's  incentive  to  encourage  consumption. 

Further,   if  the  physician  rather  than  the  hospital  is  the  prime  source  of 
advice  for  the  consumer  (the  Feldstein  "agent",  see  Feldstein,  1974),  setting 
a  break-even  rate  for  the  hospital  will  have  no  deterrent  effect  on  over- 
consumption  if  the  physician-agent  is  deriving  some  benefit,  either  financial 
or  otherwise,  from  it.     The  hospital  in  this  case  is  a  passive  actor. 

These  two  points  hold  for  every  dimension  of  output.     Including  a  quality 
measure  in  the  set  of  grouping  variables  so  that  hospitals  with  higher  quality 
are  allowed  higher  rates  encourages  the  production  of  "Cadillac"  care,  even  when 
the  value  of  the  additional  quality  to  the  consumer  is  low  relative  to  its  true 
cost  (e.g.,  all  expectant  mothers  would  be  encouraged  to  deliver  in  an  institu- 
tion with  a  full  range  of  back-up  services — at  higher  cost — even  if  there  was 
a  very  small  probability  of  complications) .     The  same  argument  holds  with 
respect  to  plant  size.     While  differences  in  scale  of  plant  may  be  appropriate 
across  market  areas  due  to  differences  in  the  level  of  demand,  adjusting  for 
scale  differences  for  all  hospitals  regardless  of  location  will  produce 
undesirable    incentives  for  the  output-maximizing  hospital. ^    This  point  cannot 
be  over-emphasized.     Building  such  incentives  into  the  reimbursement  mechanism 
may  result  in  a  system  that  fosters  higher  rather  than  lower  costs.  Further, 
in  the  long  run  technology  and  output  characteristics  will  be  determined  by  the 
objectives  of  individual  firms,  encouraged  by  svstem  incentives,  rather  than 
by  the  individual  or  collective  preferences  of  consumers. 

It  is  possible  that  even  the  composition  of  output  (hereafter  referred  to 
as  case  mix)  can  be  influenced  by  the  provider,  if  in  name  only.     That  is, rou- 
tine appendectomies  might  be  registered  as  appendectomies  with  complications 
if  producing  more  of  the  latter  moved  the  institution  to  a  group  with  a  higher 
reimbursement  rate  (or  carried  a  higher  reimbursement  rate  directly) . 


Most  empirical  studies  indicate  that  scale  differences  have  little  impact 
on  cost  once  account  has  been  taken  of  differences  in  the  nature  of  output.  If 
some  confidence  could  be  placed  in  these  studies,  this  would  suggest  that  no 
further  rate  adjustment  should  be  made  on  the  basis  of  plant  size  (e.g.,  number 
of  beds).     However,  the  methodology  and  data  used  in  most  scale  investigations 
has  been  sufficiently  questionable  so  that  their  results  must  be  regarded  with 
some  caution. 
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The  important  point  to  note  is  that  the  problem  of  regulating  firms  for 
which  demand  constraints  are  in  general  not  binding  (because  of  the  presence 
of  insurance)  is  much  more  complex  than  regulating  monopolistic  firms  in  other 
industries.     The  problems  are  exacerbated  by  the  non-profit  motive,  but  this 
is  secondary  in  its  importance  to  the  insurance  issue.     In  this  situation,  the 
regulator  can  choose  between  two  basic  objectives.     The  first  is  to  accept  the 
insurance-induced  distortions,   treat  all  dimensions  of  demand  as  exogenous,  and 
attempt  only  to  prevent  large  interfirm  differences  in  technical  inefficiency. 
The  appropriate  adjustment  variables  for  diagnostic-specific  service  rates  in 
this  case  are  measures  of  product  differentiation:     the  nature  and  quality  of 
services;  and  the  set  of  exogenous  factors:     factor  prices,  unionization,  and 
local  regulations. 

The  second  alternative  is  to  attempt  to  use  incentive  reimbursement  as  a 
mechanism  not  only  for  controlling  technical  inefficiency,  but  also  to  attempt 
to  correct  the  product  choice  inefficiency  brought  about  by  insurance.     This  is 
a  much  more  complicated  task,   since  it  implies  knowledge  of  consumers'  true 
demand  curves.     If  such  information  were  available,  or  if  society  were  willing 
to  specify  acceptable  levels  of  demand  for  various  services  (defined  across 
all  product  dimensions)  by  socioeconomic  and/or  demographic  characteristics 
of  the  population,  appropriate  areawide  budgets  could  be  developed  based  on 
these  characteristics.     The  task  of  the  local  regulator  would  then  be  to 
accept  bids  from  area  providers  for  each  type  of  service  and  allocate  the  area's 
budget  to  the  different  providers  according  to  the  lowest  bid.  Obviously, 
this  type  of  approach  is  infeasible  at  the  present  time.     A  similar  method 
which  may  be  more  feasible  is  to  attack  the  problem  from  an  altogether  different 
angle.     A  restructuring  of  insurance  coverage  to  focus  on  desired  benefits 
without  removing  the  price  sensitivity  of  consumers  would  reduce  (if  not 
remove)  the  need  for  provider  regulation."^    Regulatory  focus  in  this  situ- 
ation might  revolve  around  provision  of  information  and  more  traditional 
monopoly  control. 

However,   since  the  latter  alternatives  are  somewhat  beyond  current  possibili- 
ties,  it  is  desirable  to  seek  a  middle  ground  between  complete  passivity 
(i.e.,   treating  all  dimensions  of  demand  as  exogenous)  and  major  insurance 
reform. 

The  approach  which  will  be  assumed  in  the  remainder  of  this  discussion  is  one 
such  middle  ground.  Rather  than  treating  all  dimensions  of  output  as  exogen- 
ously  determined,  it  is  assumed  that  the  mix  of  output  (i.e.,  diagnostic  case 
mix)  and  the  nature  of  output   (in  terms  of  severity)  are,  if  not  purely  exogenous, 


One  possibility  is  to  follow  the  lead  of  some  European  countries  and 
issue  medical  care  vouchers  to  consumers.     Consumers  are  then  responsible  for 
choosing  the  source  of  care  they  receive  as  well  as  allocating  the  fixed  sum 
in  the  manner  that  best  reflects  their  preferences. 
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at  least  more  difficult  for  hospitals  and  physicians  to  influence.  Quality 
differences  are  ignored  with  the  assumption  that  providers  accepted  for  Medicare 
(or  more  general)  participation  in  the  reimbursement  program  meet  some  minimum 
standard  of  quality.     Since  standards  can  always  be  adjusted,  this  level  is 
taken  as  acceptable.     The  justification  for  this  approach  is  twofold.  First, 
the  definition  of  quality  is  somewhat  ambiguous.     High  quality  treatment  is  not 
necessarily  synonymous  with  highly  specialized,  technologically  advanced  care, 
although  that  correlation  is  too  frequently  drawn.     That  is,  a  small  hospital 
concentrating  mainly  on  simple  procedures  may  produce  higher  quality  care  as 
measured  in  terms  of  outcome  for  its  diagnostic  mix  than  a  hospital  with  very 
advanced  equipment  but  sloppy  administration  and/or  no  expertise.     Thus,  high 
quality  need  not  mean  high  cost  (Shortell,  1976).     The  second  point  is  that 
quality  is  highly  endogenous.     While  there  is  much  disagreement  in  the  litera- 
ture regarding  the  exact  elements  of  the  hospital's  objective  function,  quality 
of  care  is  often  cited  as  a  parameter  over  which  hospitals  exert  much  control 
and  one  which  likely  yields  prestige  to  both  the  administration  and  medical 
staff. 12     Therefore,  given  that  the  (insured)  market  places  little  constraint 
on  providers'  actions,  allowing  upward  rate  adjustments  for  increased  service 
quality  is  almost  certain  to  encourage  "Cadillac-ization"  of  care  well  beyond  the 
desired  level. 


B.     Program  Design  and  Variable  Selection. 


Given  the  above  assumptions  about  the  scope  and  objectives  of  the  selected 
reimbursement  scheme,  the  grouping  variables  that  are  appropriate,  in  general, 
depend  upon  the  control  parameters  of  the  program.     There  are,  however,  a  number 
of  cost-influencing  variables  identified  in  Section  II  as  exogenous  whose 
importance  is  independent  of  program  design. 

13 

The  first  in  this  set  of  variables  is  the  vector  of  input  prices.  Failing 
to  adjust  for  input  price  differences  would  penalize  the  institution  located 
in  an  area,  for  example,  of  high  wages  and  prices  in  a  way  that  a  market- 
determined  price  would  not.     That  is,  price  differences  reflecting  factor  price 


Grouping  hospitals  only  on  the  basis  of  the  demographic  characteristics 
of  the  market  in  which  they  are  located  has  a  number  of  problems.     First,  the 
hospital's  market  is  very  difficult  to  specify.     Frequently  it  does  not  con- 
form to  governmental  boundaries  (i.e.,   SMSA,  county,  etc.)  for  which  such  data 
are  readily  available,  especially  in  the  case  of  rural  hospitals  or  specialized 
facilities  that  draw  from  a  regional  or  sometimes  national  market.  Further, 
even  assuming  these  markets  could  be  identified,  adjusting  only  for  the  demo- 
graphic characteristics  of  the  population  implies  that  all  hospitals  in  an  area 
should  be  identical.     However,  given  the  indivisibilities  of  much  of  the  capital 
equipment  used  in  the  industry  it  is  likely  that  institutional  specialization 
is  desirable  and  would  lead  to  lower  system  costs. 

12 

See,  for  example,  Feldstein  (1971);  and  Newhouse  (1970). 

13 

If  more  convincing  evidence  in  support  of  the  Feldstein  proposition  that 
hospitals  influence  nursing  wages  is  found,  this  variable  must  be  excluded. 
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differentials  will  be  sustained  in  the  market  as  long  as  the  cost  of  transporting 
the  output  (in  this  case  hospital  services)  is  not  zero.     However,  in  adjusting 
for  factor  price  differences,  care  must  be  taken  not  to  dilute  the  firm's 
incentive  to  respond  to  factor  price  changes  by  substituting  away  from  an  input 
whose  relative  price  has  risen.     For  example,   if  a  locality  suffers  an  out- 
migration  of  RN's  causing  the  wage  of  this  type  of  personnel  to  rise,  the 
institution  whose  reimbursement  rate  is  adjusted  to  exactly  offset  this  increase 
will  have  little  reason  to  replace  (where  possible)  some  of  its  RN's  with  LPN's 
and  other  relatively  less  expensive  personnel.         One  possible  means  of  handling 
this  problem  is  to  use  a  factor  price  index  as  the  adjustment  variable,  rather 
than  plugging  in  each  individual  input  price.     The  development  of  such  an  index, 
however,  requires  some  knowledge  of  the  hospital's  production  function  since  the 
appropriate  weight  of  each  factor  price  in  the  index  is  the  coefficient  of  the 
factor  in  the  production  relationship. 

For  the  same  reasons,  local  regulations  and  perhaps  the  extent  of  unionization 
of  hospital  personnel  (which  relates  to  input  price  levels)  must  also  be 
recognized.     Local  regulations  which  restrict  hospital  operations  may  increase 
costs  by  preventing  the  hospital  from  choosing  the  least  costly      factor  mix 
(e.g.,  certificate  of  need  which  restricts  the  use  of  capital).     The  impact  of 
unionization  will  be  partially  reflected  in  wage  levels.     However,  additional 
costs  such  as  those  arising  from  strikes  (or  costly  negotiations  to  prevent 
strikes),  differences  in  fringe  benefits,  or  working  conditions  need  to  be 
recognized  explicitly  if  it  is  felt  that  hospitals  cannot  (or  should  not) 
affect  the  extent  of  unionization  of  their  employees.     Again,  no  windfall  losses 
or  gains  should  accrue  to  providers  because  of  this  set  of  factors  over  which 
they  exert  little  control. 

Hospital  costs  will  also  be  higher  in  rural  areas  where  demand  is  not  suffi- 
cient to  intensively  utilize  indivisible  pieces  of  hospital  capital.  Higher 
prices  to  cover  the  higher  costs  would  be  sustained  in  the  market  since,  for 
some  set  of  basic  services,  it  is  less  costly  for  consumers  to  support  an 
under-utilized  facility  than  to  travel  to  where  services  can  be  obtained  or 
to  do  without.     The  problem  of  encouraging  capital  accumulation  in  rural 
hospitals  beyond  the  optimal  level  dictated  by  consumer  preferences  is  minimized 
if  rates  are  set  based  on  comparative  rural  hospital  costs.  Intra-group 
collusion  to  "game"  the  system  is  not  likely  with  this  subset  of  providers. 
Thus,  a  variable  to  account  for  this  situation  should  be  included  in  the  set  of 
grouping  criteria. 


If  prospective  rates  are  paid  as  negotiated  with  no  end  of  the  year 
adjustment  to  reflect  differences  in  negotiated  and  actual  costs,  the  profit- 
maximizing  hospital  will  respond  appropriately  to  factor  price  changes  even  with 
built-in  rate  adjustments,   since  any  cost  savings  arising  from  such  actions  will 
increase  profits.     In  the  non-profit  sector,  this  will  happen  to  the  extent  the 
hospital  can  use  the  savings  in  salaries  for  a  more  desired  purpose  (e.g.,  in- 
vestment) .     If  costs  can  be  sorted  out  in  such  a  way  that  only  costs  for  identi- 
cal products  are  being  compared,  variation  should  result  only  from  factor  price 
differences. 
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C.     Program  Design  Parameters 

The  output  variables  reflect  differences  in  the  output  characteristics 
assumed  to  be  exogenous  in  the  discussion  at  the  beginning  of  this  section. 
However,  the  particular  variables  chosen  depend  upon  the  level  of  aggregation 
of  control  and  thus  the  program  design  parameters  selected.     Each  possibility 
will  be  discussed  separately.     In  all  cases  the  set  of  variables  discussed 
above  as  independent  of  program  design  should  also  be  included. 

C.l.     Per  Service  Reimbursement 

If  providers  are  reimbursed  separately  for  each  service  rendered,  no  output 
adjustment  is  necessary  as  long  as  hours  of  nursing  service  are  billed 
separately  also.     This  is  true  regardless  of  whether  only  routine  costs  are 
included  or  whether  all  non-teaching  costs  are  controlled.-'-^    The  only  differ- 
ence is  in  the  number  of  rates  that  are  required  in  the  two  situations. 

Per  service  reimbursement  allows  (encourages)  the  maximum  amount  of  provider- 
generated  demand.     Since  the  total  budget  increases  as  admissions,  length  of 
stay,  and/or  intensity  of  service  (tests  and  services  per  day)  increases,  the 
output-maximizing  hospital  has  the  incentive  to  increase  all  three  parameters. 
Quality  conscious  institutions  are  encouraged  to  increase  the  number  of  pro- 
cedures performed  per  visit.     Only  the  marginal  profit-maximizing  firm  is 
given  neutral  incentives  as  long  as  the  rate  of  each  service  is  set  equal  to 
efficient  production  cost.     Given  the  strong  influence  of  physicians  and 
hospitals  on  the  output  vector,  it  is  unlikely  that  this  payment  mechanism 
would  ever  lead  to  lower  hospital  costs  in  the  absence  of  very  strong  direct 
controls  on  output. 

C.2.     Per  Diem  Reimbursement 

C.2.a.     Routine  Costs 

The  only  appropriate  output  variables  in  the  case  of  the  more  aggregate  per 
diem  reimbursement  of  routine  costs  (the  current  HCFA  design)  are  a  measure  of 
case  mix  and  case  mix  severity  since  these  measures  are  likely  to  affect  the 
optimum  intensity  of  nursing  services.     Other  output  measures  are  inappropriate 
since  they  only  affect  costs  that  are  outside  the  purview  of  the  program. 

The  incentive  facing  hospitals  under  a  per  diem  rate  is  to  increase  length  of 
stay  since  later  days  generally  are  less  service  intensive  and  therefore  less 
costly.     That  hospitals  can  in  fact  respond  to  a  per  diem  control  by  keeping 
patients  longer  is  implied  by  a  study  of  prospective  payment  plan  in  downstate 
New  York.     Dowling,  et .  al.   (1976)  found  some  evidence  that  while  per  diem 
costs  may  have  fallen  slightly  as  a  result  of  the  control  program,  the  increase 
in  average  patient  stay  probably  wiped  out  any  per  case  savings. 


Teaching  costs  will  be  discussed  separately  below. 
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Control  is  further  eroded  by  including  only  routine  costs  in  the  reimbursement 
program.     Hospitals  can  escape  the  constraint  by  increasing  the  number  of 
ancillary  services  and/or  artificially  allocating  some  of  the  costs  previously 
designated  as  "routine"  to  ancillary  accounts.     Given  the  state  of  the  art  in 
hospital  accounting,  such  practices  might  be  difficult  to  detect. 

C.2.b.  Total  Non-Teaching  Costs 

When  the  control  measure  encompasses  all  non-teaching  costs  of  the  institution, 
the  theoretical  considerations  of  Section  II  dictate  that  additional  variables 
to  capture  output  differences  are  appropriate.     In  this  situation,  variables 
reflecting  both  case  mix  and  case  mix  severity  should  be  included  in  the 
grouping  criteria  since  the  composition  of  output  determines  the  appropriate 
mix  of  ancillary  services,  and  severity  again  influences  both  nursing  intensity 
and  ancillary  services.     The  previous  comments  regarding  the  desirability  of  a 
per  diem  rate  are  applicable  here  as  well. 

C.3.     Per  Case  Reimbursement:     All-inclusive  Rates 

Since  average  cost  per  case  exactly  equals  average  daily  cost  multiplied  by 
average  length  of  stay,  differences  among  hospitals  producing  identical  case 
mixes  will  occur  only  as  there  exist  inter-institutional  differences  in  treatment 
patterns  or  in  length  of  stay.     Since  these  differences  are  appropriate  only  as 
dictated  by  the  case  mix  modifier  (severity),  including  both  case  mix  (where 
total  costs  are  controlled)  and  case  mix  severity  measures  in  the  grouping  cri- 
teria, eliminates  the  necessity  for  further  grouping  differences. 

All-inclusive  per  case  reimbursement  of  hospitals  is  possibly  the  most  viable 
alternative.    While  this  control  mechanism  creates  an  incentive  to  increase 
the  number  of  admissions  (cases  treated),  the  hospital  probably  has  less  con- 
trol over  this  variable  than  service  intensity  or  length  of  stay.    As  noted 
below,  a  more  aggregate  unit  of  payment  (e.g.,  total  budget)  requires  an  explicit 
quantity  of  output  adjustment  to  avoid  arbitrary  treatment  of  institutions  ser- 
ving changing  populations.     Thus,  increases  in  output  (of  a  form  determined  by 
the  type  of  adjustment  used)  may  also  be  encouraged  with  a  more  aggregate  control. 

A  further  incentive  created  by  this  mechanism  is  for  hospitals  to  attempt  to 
change  the  mix  of  output  that  is  produced.     This  is  a  problem  created  by  any 
payment  system  based  on  averages.     The  firm  is  induced  to  increase  production 
of  the  below  average  cost  output  and  decrease  production  of  the  above  average 
cost  output.     Thus,  hospitals  would  attempt  to  encourage  "easy"  admissions 
(e.g.,  those  whose  projected  cost  was  below  the  reimbursement  rate),  and  refer 
away  the  "difficult"  admissions  (e.g.,  those  whose  projected  cost  was  above 
the  reimbursement  rate) .     Anecdotal  evidence  suggests  that  this  kind  of  sub- 
stitution ig  possible,  though  data  have  not  been  available  to  formally  test  its 
magnitude. 


The  institution's  ability  to  continue  this  substitution  over  time  is 
limited  by  an  annual  determination  of  the  reimbursement  rate  based  on  projections 
from  the  previous  year's  output  mix  for  that  hospital. 
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In  addition  to  encouraging  a  substitution  of  one  kind  of  output  for  another, 
this  payment  mechanism  encourages  hospitals  to  alter  the  characteristics  of  any 
given  type  of  output.     That  is,  the  institution  will  attempt  to  use  fewer  re- 
sources to  produce  each  admission,  regardless  of  case  type.     While  this  is 
precisely  the  intention  of  the  control  (since,  with  factor  prices  constant, 
this  is  the  only  way  costs  can  be  reduced) ,  it  is  argued  that  quality  may  suffer. 
However,  given  the  bias  toward  high  quality  care  on  the  part  of  physicians  and 
the  direct  quality  controls  imposed  elsewhere  in  the  system  (e.g.,  PSRO's, 
accreditation  bodies,  etc.),  this  is  only  likely  to  be  a  serious  problem  where 
rates  are  established  well  below  current  cost  levels. 

It  seems  clear  at  this  point  that  the  payment  mechanism  by  itself  cannot  control 
all  possible  dimensions  of  hospital  operations.     Therefore,  an  effective  cost- 
containment  program  almost  certainly  must  embody  a  check  on  output  as  well  as 
a  per  unit  rate  ceiling.     Such  a  constraint  on  individual  procedures  or  days 
of  care  would  be  difficult  to  design  and  enforce  given  the  degree  of  endogeneity 
of  these  two  variables  as  well  as  the  fact  that  no  accepted  standard  exists  as 
to  the  number  of  procedures  or  days  of  care  that  are  "appropriate"  for  a  given 
diagnosis.     Thus,  a  constraint  aimed  at  either  of  these  variables  necessarily 
must  involve  statements  about  acceptable  medical  practice  and  therefore  run  the 
risk  of  being  nebulous  or  arbitrary.     A  constraint  on  total  admissions  has  two 
advantages:     it  need  involve  no  statements  about  acceptable  medical  practice, 
and  as  noted  above,  the  number  of  admissions  is  probably  less  subject  to 
manipulation  by  the  nospital  than  are  the  other  two  variables. 

C.4.     Per  Case  Reimbursement:     Diagnostic-Specific  Rates 

An  alternative  to  the  all  inclusive  per  case  payment  unit  (e.g.,  average  cost 
per  case  reimbursement)  is  the  determination  of  diagnostic-specific  per  case 
rates.     Thus,  instead  of  setting  one  rate  for  all  admissions  and  grouping  by 
the  exogenous  variables  (factor  prices,  etc.),  case  mix,  and  case  mix  severity, 
separate  rates  could  be  set  for  each  different  case-type  (diagnosis) .     In  this 
situation,  the  appropriate  grouping  variables  are  again  the  exogenous  factors 
(input  prices,  etc.)  and  case  mix  severity.     If  the  diagnosis  definition  includes 
a  severity  modifier  (e.g.,  if  acute  appendicitus  complicated  by  diabetes  and 
uncomplicated  appendicitus  are  defined  as  different  case  types,  only  the 
exogenous  variables  need  appear  in  the  set  of  grouping  criteria. 17 

An  appealing  feature  of  the  diagnostic-specific  per  case  payment  system  is 
that  the  problem  of  output  substitution  noted  above  for  all-inclusive  case 
rates  is  reduced  or  eliminated,  depending  upon  the  specificity  of  case-type 
definitions  and  the  accuracy  of  the  case-specific  rate  (obviously  if  the  payment 
rate  for  routine  appendectomies  is  set  well  above  its  expected  cost  of  production 
hospitals  will  attempt  to  do  more  appendectomies) .     The  issue  of  quality  reduc- 
tion as  the  characteristics  of  each  type  of  output  changes  remains. 

Another  advantage  of  this  system  is  that  no  case  mix  variable  need  be  included 
in  the  set  of  grouping  criteria  since  differences  in  case  mix    across  hospitals 
are  accounted  for  explicitly  in  the  number  of  payments  of  each  type  the 


This  is  essentially  the  Worthington  approach  (See  Worthington  and  Hixson 

1975). 
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hospital  receives.     Thus,  case  mix  data  need  not  be  collected  by  the  regulatory 
body  for  purposes  of  grouping.     These  data  become  available,  however,  as  insti- 
tutions request  payment,  just  as  information  regarding  the  number  of  days  of 
care  provided  is  made  available  under  current  per  diem  systems. 1°  On  the  other 
hand,  the  fact  that  hospitals  supply  this  information  in  the  form  of  payment 
requests  leaves  an  uncomfortable  degree  of  room  for  data  manipulation.     That  is, 
if  diagnosis  A  commands  a  higher  rate  than  diagnosis  B  (which  involves  basically 
the  same  body  system  and  requires  many  of  the  same  procedures) ,  there  is  a  large 
incentive  for  the  hospital  to  label  all  diagnostic  B  patients  as  patients  with 
diagnosis  A.     The  ability  of  the  regulatory  agency  to  constrain  this  type  of 
action  is  extremely  limited,  especially  given  that  there  is  often  legitimate 
disagreement  among  physicians  over  questions  of  appropriate  diagnosis. 

A  further  problem  with  this  approach  is  that  separating  a  hospital's  total  costs 
into  case-specific  costs  with  any  degree  of  accuracy  is  likely  to  be  extremely 
difficult  given  the  current  accounting  parctices  of  hospitals.     Thus,  a  diagnostic- 
specific  per  case  system,  while  conceptually  very  close  to  an  all-inclusive  per 
case  control,  is  probably  a  less  desirable  approach  from  a  practical  standpoint. 

C.5.  Total  Budget  Reimbursement 
C.5.a.  Routine  Costs 

This  is  the  most  aggregate  payment  unit.    Where  a  total  budget  for  routine  costs 
is  determined  prospectively,  account  must  be  taken  of  the  quantity  of  output 
(i.e.,  days  of  care)  produced  as  well  as  severity,  as  before.     If  two  facilities 
have  identical  (severity  adjusted)  per  diem  or  per  case  routine  costs,  they  will 
still  have  different  total  yearly  general  service  costs  if  they  produce  different 
levels  of  output.     Unfortunately,  this  is  difficult  since  determining  budgets 
for  hospitals  grouped  by  quantity  of  output  encourages  the  output  maximizer  to 
expand  even  if  all  budgets  are  set  at  a  break -even  level.     This  distortion 
would  have  to  be  checked  with  an  additional  constraint.     The  earlier  comments 
regarding  the  possibility  of  escaping  control  by  allocating  routine  costs  to 
unregulated  accounts  again  hold. 

C.5.b.     Total  Non-Teaching  Costs 

The  above  argument  holds  for  total  non-teaching  budget  reimbursement  with  the 
addition  of  a  case  mix  variable.  Again,  some  additional  constraint  is  called 
for  to  discourage  expansion  beyond  the  optimal  level  (See  Table  2.2). 


Pre-control  baseline  data,  which  would  be  extremely  valuable  in  monitor- 
ing output  changes  resulting  from  the  payment  program,  would  still  have  to  be 
collected  directly. 
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TABLE  2.2 

Program  Design  and  Variable  Selection 


ALL  SYSTEMS 

1.  Input  Prices 

2.  Local  Regulations 

3.  Extent  of  Unionization 

4.  Rural  Markets 


I.     PER  SERVICE  REIMBURSEMENT 


A.  Routine  Costs 

1.  No  Further  Adjustment 

B.  Total  Costs 

1.  No  Further  Adjustment 


— Increase  Admissions 

— Increase  Length  of  Stay 

— Increase  Intensity  of  Service 


II.     PER  DIEM  REIMBURSEMENT 

A.  Routine  Costs 

1.     Case  Mix  Severity 

B.  Total  Costs 

1.  Case  Mix  Severity 

2.  Case  Mix 

— Increase  Admissions 

— Increase  Length  of  Stay 


III.     PER  CASE  REIMBURSEMENT:     ALL  INCLUSIVE  RATES 

A.  Routine  Costs 

1.     Case  Mix  Severity 

B.  Total  Costs 

1.  Case  Mix  Severity 

2.  Case  Mix 

— Increase  Admissions 


IV.     PER  CASE  REIMBURSEMENT:     DIAGNOSTIC  SPECIFIC  RATES 

A.  Routine  Costs 

1.     Case  Mix  Severity 

B.  Total  Costs 

1.     Case  Mix  Severity 

— Increase  Admissions 

V.     TOTAL  BUDGET  REIMBURSEMENT 

A.  Routine  Costs 

1.  Case  Mix  Severity 

2.  Quantity  of  Output 

B.  Total  Costs 

1.  Case  Mix  Severity 

2.  Case  Mix 

3.  Quantity  of  Output 
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D.  Teaching  Costs 

To  this  point,  no  explicit  mention  has  been  made  of  teaching  expenses.  Teaching 
is  an  aspect  that  has  often  been  identified  as  having  a  significant  impact  on 
costs  (Lave  and  Lave,  1970);  however,  little  study  has  been  directed  at  the 
explanation  of  this  observation.     If  teaching  facilities  serve  a  patient  group 
with  more  complex  conditions  (as  is  sometimes  argued)  this  would  be  accounted 
for  via  the  case  mix  and  severity  adjustments.     If  the  quality  of  care  is  higher 
in  these  hospitals,  this  aspect  would  also  be  addressed.     Thus,  only  the  issue 
of  costs  associated  directly  with  training  of  new  health  personnel  is  omitted. 

There  are  two  arguments  which  might  be  made  here.    The  first  is 
that  teaching  costs    might  be  accounted  for  in  terms  of  lump  sum  program  costs 
rather  than  a  percentage  increase  in  lab  costs,  routine  service  costs,  CCU 
costs,  etc.     That  is,  the  time  resources  required  of  the  chief  surgeon  to 
instruct  interns  in  his  (her)  department  are  appropriately  charged  to  teaching 
rather  than  to  surgery  since  it  is  a  separate  activity.19  Second,  since  the 
benefits  of  the  teaching  program  accrue  not  just  to  the  patients  in  that  parti- 
cular institution  but  more  generally  to  the  population  as  a  whole   (as  with 
medical  research) ,  there  is  no  reason  to  argue  that  only  those  served  in  the 
teaching  hospital  should  pay  all  associated  teaching  costs.     A  more  appropriate 
funding  mechanism  might  be  based  on  general  tax  support,  either  from  state  funds 
or  federal  funds,  depending  upon  expectations  about  migration. 

Building  instructional  costs  into  the  rate  structure  not  only  raises  equity 
questions,  but  also  has  implications  about  optimal  program  size.     If  medical 
education  is  supported  through  lump  sum  grants,  the  amount  of  resources  devoted 
to  this  activity  is  directly  controlled  by  public  policy  at  the  appropriate 
level.     When  teaching  costs  are  met  with  inpatient  revenue,  program  size  and 
composition  (in  terms  of  specialty  mix)  is  determined  by  individual  hospitals, 
and  is  influenced  to  some  extent  by  demand  for  that  institutions 1 s  services. 
Creating  a  rate  control  system  that  explicitly  allows  for  teaching  costs, 
especially  as  a  percentage  of  total  patient  costs,  encourages  the  expansion  of 
teaching  without  tying  its  growth  to  increases  in  demand  for  personnel.  However, 
it  is  possible  that  direct  payment  for  teaching  costs  may  not  be  feasible  either 
practically  or  politically  at  this  time.     If  this  is  the  case,  and  if  such 
additional  expenses  are  judged  appropriate  at  their  current  levels,  a  teaching 
variable  could  be  included  in  the  set  of  grouping  variables. 

E.  The  Problem  of  Weighting 

Once  the  appropriate  set  of  grouping  variables  has  been  identified,  there 
remains  the  question  of  relative  weights.     That  is,  if  input  price  and  (exogen- 
ously  determined)  output  differences  are  both  "legitimate"  sources  of  cost 
variation  among  firms,  which  is  more  important  in  terms  of  its  impact  on  cost? 
If  hospital  A  faces  higher  input  prices,  but  hospital  B  produces  a  more  complex 
output ,  which  hospital  would  be  expected  to  have  higher  costs ,  assuming 
homogeneity  along  all  other  dimensions  including  efficiency?    The  question  can 
also  be  asked  of  the  various  kinds  of  output — e.g.,  brain  surgery  vs.  hernia 
repair . 


There  is  a  practical  problem  with  this  approach  since  teaching  costs 
would  be  very  difficult  to  identify  given  current  hospital  accounting  systems. 
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There  is  no  theoretical  answer  to  the  question.     Empirically,  it  is  possible 
to  determine  the  relative  share  of  each  cost  component,  but  this  analysis  will 
only  produce  appropriate  shares  if  firms  are  producing  efficiently  in  the 
economic  sense.     That  is,  regressing  cost  on  factor  prices  and  output  character- 
istics will  only  yield  coefficients  that  describe  current  hospital  operations 
rather  than  those  that  identify  the  efficient  cost  relationship.  Further, 
since    it  is  possible  (likely)  that  hospitals  in  different  groups  have  different 
cost  structures  (i.e.,  there  are  slope  as  well  as  intercept  differences  in 
their  respective  cost  functions),  appropriate  weights  might  differ  across  groups. 
Although  these  empirical  approximations  are  admittedly  imperfect,  they  may 
represent  the  most  acceptable  alternative.     Another  alternative,  of  course,  is 
to  weight  each  of  the  variables  equally. 

F.     Partial  Coverage 

Regardless  of  the  set  of  variables  for  which  rate  adjustments  are  made,  the 
program  will  impact  differently  on  the  industry  depending  upon  the  subgroup  of 
patients  to  whom  the  program  pertains.     If  only  costs  associated  with  Medicare 
patients  are  controlled,  the  net  impact  on  total  institutional  cost  is  likely 
to  be  small.     Further,  the  hospital  has  at  least  two  options  for  circumventing 
the  system.     The  first  is  to  allocate  costs  that  are  not  covered  by  Medicare 
reimbursement  to  other  patient  groups,  either  directly  to  justify  higher  cost- 
based  reimbursement  or  by  charging  prices  in  excess  of  costs  to  make  up  the 
dif ference .      Medicare  costs  are  still  lowered  in  this  situation,  but  this 
decrease  will  be  more  than  offset  by  increases  in  non-Medicare  costs  if  the 
juggling  requires  any  administrative  resources.     There  are  also  problems  of 
equity  since  non-Medicare  patients  are  thus  required  to  partially  subsidize 
the  services  of  the  older  group. 

The  second  option  open  to  the  hospital  is  to  refuse  to  admit  Medicare  patients. 
If  this  option  were  exercised,  the  impact  could  range  from  inconvenience  to 
patients  whose  admissions  were  delayed  because  of  transfers  to  a  reversal  of 
Medicare's  role  in  access  to  care. 

Slippage  of  this  sort  can  be  somewhat  mitigated  by  building  positive  as  well 
as  negative  incentives  into  the  system.     If  rates  are  set  as  group  averages  with 
providers  grouped  according  to  the  variables  identified  above,  hospitals  whose 
costs  fall  below  the  mean  will  make  a  profit  on  each  Medicare  admission.  Thus, 
over  time,  Medicare  patients  may  be  shifted  from  high  cost  (above  the  mean) 
hospitals  to  low  cost  hospitals.     If  this  is  a  pure  substitution  of  one  group 
of  patients  for  another,  total  system  costs  will  be  unaffected.  Medicare 
costs  will  be  lowered  only  to  the  extent  high  cost  hospitals  drop  Medicare 
participation  completely  so  that  their  average  costs  are  no  longer  calculated 
into  the  mean  for  purposes  of  rate  determination. 

IV.     Variable  Measures 

The  previous  sections  have  identified  the  set  of  variables  which,  from  a  con- 
ceptual viewpoint,  should  form  the  basis  of  a  hospital  classification  system. 


The  extent  to  which  this  is  feasible  depends  on  the  method  of  payment 
for  non-Medicare  services  and  the  size  of  the  Medicare  group  relative  to  the 
total  patient  load. 
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However,  to  translate  the  conceptual  notion  into  a  workable  empirical  framework 
for  the  purpose  of  implementation,  it  is  necessary  to  identify  specific  measures 
by  which  the  variables  can  be  represented. 

Ideally,  the  measures  chosen  are  exact  empirical  reflections  of  the  variables: 
a  family  purchasing  power  variable  translates  into  median  family  income  for 
the  calendar  year  under  consideration,  including  wages,  interest,  dividends, 
and  other  factor  payments;  transfer  payments  (e.g.,  social  security,  welfare, 
etc.);  income  in  kind  (including  imputed  rental  value  of  housing  and  employer 
contributions  to  fringe  benefits);  minus  tax  liabilities  at  all  levels.  Usually 
this  translation  of  variables  into  their  measures  does  not  contain  the  complete 
conceptual  thrust  of  the  original  variables.     Data  restrictions  to  a  large 
degree  constrain  the  set  of  measures  available.     In  the  previous  example,  data 
on  in-kind  income  (especially  imputed  figures)  are  not  generally  available. 
Further,  the  ideal  measure  is  often  not  fully  specified  by  the  conceptual  model. 
That  is,  should  windfall  gains  that  accrue  during  the  study  year  be  included 
in  the  previous  example  since  they  add  to  consumers'  purchasing  power,  or  should 
they  be  omitted  on  the  assumption  that  unexpected  income  is  quickly  tucked  into 
savings  and  ignored  by  the  consumer  for  the  purposes  of  current  desisions? 
Questions  such  as  these  are  answerable  only  empirically,  since  a  sound  conceptual 
model  could  be  developed  to  predict  either  outsome.     Therefore,  it  is  not  always 
possible  to  derive  a  single  set  of  ideal  measures  to  reflect  the  appropriate 
variables  even  without  data  restrictions. 

The  conceptual  framework  identified  six  variables  as  sources  of  expected  cost 
variation  among  efficient  institutions,  based  upon  the  application  of  market 
price  theory  to  the  hospital  setting:     factor  prices,  unionization,  external 
regulation,  rural  scale  diseconomies,  case  mix  composition,  and  case  mix  sever- 
ity.    The  remainder  of  this  section  will  discuss  how  each  of  these  variables 
might  ideally  be  represented  empirically,  and  given  the  data  restrictions  of  the 
present  study,  how  they  will  be  represented  for  purposes  of  this  analysis.  This 
information  is  summarized  in  Table  2.3.     Finally,  some  attempt  will  be  made  to 
assess  the  expected  impact  on  the  resulting  classification  due  to  the  substi- 
tution of  available  measures  for  ideal  measures. 


A.     Factor  Prices 

The  factor  price  variable  is  intended  to  pick  up  exogenous  differences  in  the 
prices  that  hospitals  in  different  input  market  areas  must  pay  for  these  inputs 
The  ideal  measure  of  this  variable  is  sensitive  not  only  to  absolute  price 
level  differences,  but  also  to  the  impact  of  selective  differences.     That  is, 
a  ten  percent  differential  in  the  price  of  labor  between  two  areas  will  have  a 
larger  effect  on  hospital  cost  than  will  a  ten  percent  difference  in  the  price 
of  laundry  detergent.     Further,  it  must  be  recognized  that  the  extent  to  which 
input  price  differences  will  be  reflected  in  final  costs  will  depend  upon  the 
substitutibility  of  inputs  in  the  production  process.     That  is,  if  labor 


average  family  income 
statistics . 


Normally,  median  family  income  is  used  instead  of 
so  that  the  extreme  values  of  income  may  not  distort  the 
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TABLE  2.3 

Empirical  Measures  for  Conceptual  Variables 

1.  -  INPUT  PRICES 

A.  Hospital  Wages 

B.  Manufacturing  Hourly  Wage 

C.  Transportation  and  Public  Utilities  Hourly  Wage 

D.  Retail  Hourly  Wage 

2.  LOCAL  REGULATIONS 

A.     No  Measures  Available 

3.  EXTENT  OF  UNIONIZATION 

A.     No  Measures  Available 


4.     RURAL  MARKETS 

A.     Uniform  Pressure  Occupancy  Index 

5a.  CASE  MIX:     ENDOGENOUS  APPROACH 

A.  Number  of  Basic  Services 

B.  Number  of  Quality  Enhancing  Services 

C.  Number  of  Complex  Services 

D.  Number  of  Community  Services 

E.  Number  of  Births/Number  of  Discharges 

F.  Number  of  Surgical  Operations/Number  of  Discharges 

G.  Number  of  Outpatient  Visits/Number  of  Discharges 


5b.  CASE  MIX:     EXOGENOUS  APPROACH 

A.  Median  Family  Income 

B.  Percentage  of  Population  Female  and  Aged  15-44  Years 

C.  Percentage  of  Population  Aged  0-5  years. 

D.  Percentage  of  Families  Earning  Income  Less  Than  $4,000 

E.  Labor  Force  Participation  Rate  Age  16  and  Over 

F.  OB-GYN's  per  10,000  Population 

G.  Primary  Care  M.D.'s  per  10,000  Population 

H.  Percentage  of  Population  non-White 

I.  Disability  Rate  Age  16-64  Years 

J.  Percentage  of  Population  Aged  65  Years  and  Over 

K.  Percentage  of  M.D.'s  Aged  60  and  Over 

L.  Medical  Specialists  per  10,000  Population 

M.  Other  Direct  Care  Specialists  per  10,000  Population 

N.  Other  Specialists  per  10,000  Population 

0.  Surgical  Specialists  per  10,000  Population 


6a.  CASE  MIX  SEVERITY:     ENDOGENOUS  APPROACH 

A.  Percentage  of  Population  Aged  0-5  Years 

B.  Percentage  of  FAmilies  Earning  Income  Less  Than  $4,000 

C.  Percentage  of  Population  Aged  65  Years  and  Over 

6b.  CASE  MIX  SEVERITY:     EXOGENOUS  APPROACH 
A.     No  Further  Measures  Necessary 
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initially  accounts  for  50  percent  of  total  expenses  per  unit  of  output,  a 
10  percent  rise  in  the  price  of  labor  will  only  result  in  a  5  percent  increase 
in  the  cost  of  a  marginal  unit  if  other  inputs  cannot  be  substituted  (at  least 
to  some  extent)  for  the  now  more  expensive  input.     Thus,  the  ideal  measure  of 
the  impact  of  factor  price  differences  on  output  cost  is  an  index  formed  as 
the  weighted  sum  of  relevant  factor  prices  where  weights  reflect  the  input's 
coefficient  in  the  production  function. ^2  Since  the  data  to  construct  such 
an  ideal  measure  are  not  available  in  the  current  study,  a  much  less  exact 
measure  will  be  used.     Factor  prices  will  be  represented  as  a  vector  with  four 
elements:     a  computed  average  HOSPITAL  WAGE,  MANUFACTURING  HOURLY  WAGE,  TRANS- 
PORTATION AND  PUBLIC  UTILITIES  HOURLY  WAGE,  and  RETAIL  TRADE  HOURLY  WAGE.  The 
first,  average  hospital  wage,  will  be  constructed  as  the  endogenous  measure, 
PAYROLL  FOR  ALL  OTHER  PERSONNEL,  divided  by  the  sum  of  PERSONNEL—FULL  TIME  ALL  OTHER 
REGISTERED  NURSES,  PERSONNEL—FULL  TIME  LPN's,  and  PERSONNEL— FULL  TIME  ALL  OTHER 
(collected  from  the  AHA  data).     Because  this  measure  uses  the  hospital's  own 
payments  per  employee  it  is  only  useful  as  an  historic  measure.     That  is,  in 
an  on-going  system,  the  use  of  such  an  endogenous  variable  might  encourage 
hsopitals  to  increase  their  payroll  in  the  hopes  of  moving  themselves  to  a  more 
renumerative  group.       The  other  three  wage  categories  are  used  as  proxies  for  the 
general  cost  of  living  in  the  area  which  primarily  reflects  prices  of  non-labor 
inputs  (e.g.,  food,  etc.). 
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If  there  are  two  inputs,  L  and  K  (with  prices,  w  and  c  respectively) 
used  to  produce  some  output,  Q,  then  total  cost  is  given  by: 

TC  -  wL  +  cK 

Differentiating  with  respect  to  Q  gives: 

mp  =  i^C        9L  9K 
3Q  C3Q 

The  change  in  the  cost  of  a  marginal  unit  of  output  as  one  factor  price,  w, 

changes  by  one  unit  is: 

3MC      3L  /c.  ,       9MC      9K  . 

9w~  =  3Q  (Similarly,  ^-  =  ^  ) 

Thus,  the  impact  on  marginal  cost  of  finite  charges  in  all  input  prices  is: 

™C-d»f|     do  || 
3L  3K 

But,  irzr  and  7777  are  simply  the  inverses  of  the  marginal  products  of  those  inputs, 
or  the  inverses  of  their  coefficients  in  the  production  function. 
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In  addition,  the  measure  only  captures  a  subset  of  factor  prices  and 
will  be  influenced  by  differences  in  the  mix  of  labor  in  the  categories  used 
as  well  as  differences  in  the  wages  of  these  markets. 
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The  bias  resulting  from  the  use  of  this  measure  rather  than  the  ideal  coun- 
terpart is  uncertain  in  direction  and  magnitude. 

B.  Unionization 

The  purpose  of  this  variable  is  to  capture  the  cost  impact  of  unionization 
that  is  not  reflected  in  wage  levels.     That  is,  management  may  incur  additional 
operating  costs  because  of  the  presence  or  threatened  presence  of  labor  unions: 
higher  fringe  benefits,  costs  associated  with  union  bargaining,  etc.     The  ideal 
measure  of  this  variable  includes  the  percentage  of  an  institution's  employees 
that  are  unionized  (to  capture  the  extent  of  power  of  the  unions),  the  number 
of  unions  represented  (as  a  measure  of  the  transaction  costs  involved  with  union 
dealings),  and  some  measure  of  the  threat  of  unionization — perhaps  the  percen- 
tage of  service  employees  in  the  county  which  are  unionized. 

In  the  present  study,  no  data  regarding  unionization  were  available.     It  is 
unlikely  that  a  serious  distortion  is  caused  by  this  omission. 

C.  External  Regulation 

The  impact  of  external  regulation  on  costs  Is  not  dissimilar  in  nature  to 
that  of  unionization.     Regulation  restricts  management  decision-making  thereby 
increasing  operating  costs  (unless  the  regulation  is  not  binding).  Further, 
there  are  costs  associated  with  dealing  with  the  regulatory  agency  (i.e., 
filling  out  forms,  appearing  at  hearings,  etc.).    As  with  unionization,  the 
ideal  measure  of  this  variable  captures  both  the  direct  costs  of  the  restric- 
tions and  the  "haggle"  cost.     The  latter  might  be  measured  by  the  number  of 
different  agencies  with  which  the  hospital  had  to  deal  (although  this  is  an 
imperfect  measure  since  the  extent  of  the  administrative  requirements  of 
different  agencies  is  likely  to  vary  substantially).     The  former  is  more  diffi- 
cult to  measure.     Ideally,  some  attempt  would  be  made  to  assess  the  total  cost 
impact  of  various  restrictions.     For  example,  certificate  of  need  legislation 
may  induce  the  hospital  to  substitute  labor  or  uncontrolled  equipment  for 
additional  beds,  resulting  in  higher  average  costs. ^  On  the  other  hand, 
capital  controls  may  act  to  limit  entry  (and  therefore  competition)  thus 
decreasing  the  amount  of  resources  that  must  be  spent  attracting  physicians 
from  competing  institutions.     Further,  it  must  be  noted  that  such  cost  assess- 
ments must  be  made  from  an  empirical  rather  than  a  theoretical  perspective 
since  the  implementation  of  a  law  often  differs  substantially  from  its 
legislative  intent. 

Again,  no  regulation  data  are  available  in  the  present  study.  The  degree  of 
bias  resulting  from  this  distortion  cannot  be  known  without  some  idea  of  the 
magnitude  of  costs  arising  from  regulatory  restrictions,  and  the  variability 
of  regulations  across  hospitals. 


A  study  done  by  Salkever  and  Bice  (1976)  lends  some  empirical  support 
to  this  notion. 
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D.     Rural  Markets  Variable 

While  cost  differences  arising  from  economies  of  scale  are  not  generally 
characterized  as  justifiable  for  reimbursement  purposes,  in  the  special  case  of 
rural  hospitals  an  exception  is  made .     The  argument  has  been  made  that  rural 
communities  may  not  be  able  to  support  a  set  of  basic  facilities  at  optimal 
capacity.     The  resulting  higher  average  costs  in  this  case  are  justifiable, 
however,  since  it  may  be  less  costly  for  the  community  to  finance  this  excess 
capacity  than  to  seek  basic  services    elsewhere  or  to  do  without  them. 

While  the  general  conceptual  notion  is  clear,  its  translation  into  an  empirical 
measure  is  much  less  so.     The  variable  is  attempting  to  allow  for  the  impact 
of  desired  (by  the  community)  excess  capacity  on  costs.     In  a  competitive  situ- 
ation, the  appropriate  adjustment  would  be  found  in  the  market  as  the  amount 
consumers  are  willing  to  pay  over  what  they  would  pay  in  a  larger  neighboring 
community  (this  is  obviously  influenced  by  the  costs  of  getting  to  the  larger 
town).     Within  the  confines  of  a  reimbursement  system,  however,  the  accurate 
measurement  of  this  variable  is  necessarily  imperfect. 

The  rural  markets  variable  proposed  for  this  study,  referred  to  as  the  uniform 
pressure  occupancy  index  (UPOI) ,  is  based  upon  the  concept  of  uniform 
probability  of  overflow  suggested  by  Rosenthal  (1964).     In  brief,  Rosenthal's 
approach  computes  those  values  of  average  daily  census  for  each  geographical 
area  such  that  the  probability  of  hospitals  being  filled  to  capacity  will  be 
the  same  irrespective  of  their  size  and  location.     Rosenthal's  measure,  however, 
is  based  on  the  average  size  of  hospitals  in  a  county — an  assumption  we  tested 
vis-a-vis  regression  analysis,  using  county  population,  county  density,  and 
a  dummy  variable  indicating  SMSA/non-SMSA  as  independent  variables.     Two  regression 
equations,  one  using  the  total  number  of  county  hospital  beds  and  the  other 
using  the  average  number  of  county  hospital  beds  as  dependent  variables,  were 
determined  from  our  complete  sample  of  1,070  hospitals.     The  results  are  pre- 
sented in  Table  2.4;  from  this  table  it  is  evident  that  using  total  county  hospital 
beds  is  significantly  preferred  as  an  urban/rural  indicator.     Therefore,  our 
development  of  the  rural  variable  is  based  on  the  total  number  of  hospital 
beds  in  a  county  (B)  and  differs  from  Rosenthal's  measure  in  this  respect. 

The  computations  are  also  based  upon  the  additional  assumptions: 

1.  The  probability  of  overflow  for  any  hospital  is  constant  for  any  given 
day . 

2.  The  daily  census  is  Poisson  distributed. 

3.  The  probability  of  overflow  cannot  exceed  0.01  (i.e.,  the  overflow 
will  not  occur  more  than  one  day  in  100) . 

The  uniform  pressure  occupancy  index  is  calculated  by  finding  the  value  of  the 

average  daily  census  (ADC)  for  each  county  such  that  the  probability  of 

demand  exceeding  the  total  number  of  beds  in  that  county  is  less  than  0.01, 

and  dividing  ADC  by  the  total  number  of  beds  in  the  county  (B) .     For  example, 

if  the  total  number  of  county  beds  is  75,  then,  by  the  Poisson  assumption, 

the  following  formula  will  compute  ADC  such  that  75  total  beds  will  meet  the  demand 
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99  percent  of  the  time: 

75  -ADC 


0.99  =  I 
x=0 


(ADC)' 


x; 


(2.1) 


Using  a  commulative  Poisson  table  to  solve  the  above  equation,  one  finds  that 
ADC  =  55,  and,  by  definition, 

ADC  55 

Uniform  Pressure  Occupancy  Index      (UPOI)  =  — —  =  —  =  0.733. 

While  the  calculation  of  UPOI  is  theoretically  straightforward,  a  major 
obstacle  is  provided  by  the  fact  that  cummulative  Poisson  tables  do  not  readily 
exist  for  values  of  ADC  >  30.     However,  using  the  fact  that  the  Poisson  distri- 
bution approaches  the  normal  distribution  for  values  of  ADC  >  30,  equation  (2.1) 
can  be  rewritten  as: 


lim  B 
ADC-w  I 
x=0 


e^0  (ADC)X 
x! 


a  /2tt 


-(x  -  ADC) 
2a 


dx  =  0.99 


(2.2) 


'  -00 


where  a  =  /ADC  (Equation  (2.2)  states  that  x  is  normally  distributed  with  a 
mean  equal  to  ADC  and  standard  deviation  equal  to  /ADC).     To  find  the  value 
of  B  such  that  the  area  under  the  normal  curve  from  -°°  to  B  is  equal  to  0.99, 
let  Z  be  a  standard  normal  variate,  such  that 


Pr 


Z  = 


B  -  ADC 


/ADC 


=  0.99, 


The  value  of  Z  is  then  found  to  equal  K  from  a  table  of  standard  normal  values 
(where  K  -  2.33  for  a  probability  of  0.99).     In  general, 


B  -  ADC 


=  K  =>  ADC  = 


and 


/ADC 


UPOI  = 


-K  +  A2  + 


4B 


f  /~2  1  2 

-K  +  /  K    +  4B 


ADC 
B 


2  /~1  2 

K    -  2K  /  K    +  4B  +  k    +  4B 

4B 


Solving  for  UPOI,  one  finds 

K2 

UPOI  =  1  +  2B  " 
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Since 


K 


4B 


0  for  typical  values  of  K  (<3)  and  B  (>40), 


K4  K2 
1  + 


K 


K 


4B 


/    4B2   '   /  B 


and  the  expression  for  UPOI  can  be  written  as  follows: 

2 


UPOI   i  1  -  fy  -  : 


/  "2 


2  2 

.-■  /  K"  _  K  K  K_ 

/    .Jl         i    B  2B      2B  " 


4B 


2?i  -i 


(2.3) 


In  the  above  example  where  ADC  =  55,  it  was  found  that  UPOI  =  0.733  using 
the  Poisson  tables;  using  (2.3),  UPOI  =  1  -  2.33  =  0.731.     For  K  =  2.33 

/75 

(i.e.,  the  probability  of  0.99),  typical  values  of  the  UPOI  calculated  from 
(2.3)  are  shown  in  Figure  2.1. 


TABLE  2.4 

Regression  Analysis:     Rural  Markets  Variable 


Regression  1:     Dependent  Variable  -  Total  Number  of  County  Hospital  Beds 


Independent 
Variable 

(Constant) 
Population 
Density 
SMSA 


Coefficient 

31.2 

.0043 

.099 
78.35 


Cummulative  R 

.871 
.892 
.892 


Overall 
Overall  F  Significance 

1703.5  0.0 


Regression  2:     Dependent  Variable  -  Average  Number  of  County  Hospital  Beds 

2 

Uvei 

8.1  0.0 


Independent 
Variable 


(Constant) 
Population 
Density 
SMSA 


Coefficient  Cummulative  R 

-5 


Overall  F 


Overall 
Significance 


104.2 

.625x10 
.0039 

104.67 


.079 
.094 
.299 
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FIGURE  2.1 

Uniform  Pressure  Occupancy  Index  (UPOI) 
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E.     Case  Mix  Composition 


The  ideal  measure  of  this  variable  captures  the  range  of  product  mix  in  a 
hospital.     Since  it  is  believed  that  product  mix  differences  lead  to  average 
cost  differences  the  description  of  case  mix  composition  must  be  detailed 
enough  to  differentiate  between  two  product  types  whose  production  costs  differ. 
Ideally,  every  different  product  would  be  identified  even  if  at  present  their 
cost  differences  are  small  since  over  time,  changes  in  technology  might  convert 
these  cost  differences  to  ones  of  greater  magnitude.     Thus,  case  mix  aggregation 
along  departmental  or  disease  code  lines  may  not  be  appropriate  if  the  product 
specification  is  expected  to  the  useful  in  the  long  run.^-> 

Many  studies  have  used  a  scope  of  services  index  as  a  proxy  for  case  mix.^ 
The  idea  is  that  a  hospital,  for  example,  can  only  treat  surgical  patients  if 
it  has  an  operating  room.    While  this  fact  is  obviously  true,  the  reverse 
correlation  is  not  as  strong,  expecially  in  a  non-profit  oriented  industry. 
That  is,  the  presence  of  a  coronary  care  unit  of  a  given  size  in  two  different 
hospitals  does  not  imply  that  the  CCU  occupancy  rate  is  the  same  in  both  insti- 
tutions, or  that  the  number  of  cases  treated  in  that  unit  as  a  percentage  of 
total  admissions  is  the  same.     Even  the  expressed  desire  to  add  a  CCU  to  an 
existing  structure  need  not,  in  a  non-profit  firm,  mean  that  community  demand 
for  such  a  service  is  high. 

More  importantly,  even  if  a  scope  of  services  index  is  an  acceptable  proxy  for 
case  mix  in  a  retrospective  study  of  hospital  costs, ^7  it  is  definitely  not  a 
good  case  mix  substitute  in  a  rate  setting  system.     If  the  program  allows  hos- 
pitals with  a  more  complex  or  wider  range  of  services  to  be  placed  in  groups 
with  higher  rates,  the  clear  incentive  for  the  administrator  is  to  add  services 
regardless  of  how  intensively  they  might  be  used.     The  result  over  time  would 
be  a  proliferation  of  underutilized  facilities  with  substantial  implications 
for  both  cost  and  quality  of  care^     (see  Glasgow,  et.  al.,  1976).     The  correla- 
tion between  case  mix  and  the  services  index,  even  if  perfect  in  the  initial 


25 

Obviously  there  are  practical  tradeoffs.     Empirical  investigations  might 
indicate  that  aggregation  to  some  level  loses  less  in  terms  of  accuracy  than 
it  might  save  in  terms  of  additional  data  collection  costs. 

26 

See,  for  example,  Berry  (1974). 

27  j... 
Some  on-going  research  at  the  University  of  Washington  indicates  that 

even  retrospectively  a  fairly  detailed  scope  of  services  index  correlates  poorly 
with  case  mix  (the  correlation  coefficient  is  0.03).  This  result  must  be  inter- 
preted with  caution,  as  no  adjustments  have  been  made  for  case  mix  severity. 


Empirical,  evidence  suggests  that  outside  controls  on  capital  expansion 
have  been  less  than  effective.     See  David  Salkever  and  Tom  Bice  (1975). 
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period,  would  be  small  in  future  periods.     Further,  since  data  regarding  scope 
of  services  (beyond  the  very  simplistic  AHA  annual  questionnaire  data)  would 
also  have  to  be  collected  if  it  were  to  be  used  as  a  grouping  variable,  it 
would  seem  much  more  prudent  to  go  after  the  real  variable,  case  mix,  rather 
than  its  proxy. 

E.l.     Exogenous  Approach 

For  the  present  study,  no  explicit  case  mix  data  at  any  level  of  detail  are 
available.     Therefore,  case  mix  composition  will  be  estimated  from  two  opposite 
approaches,  the  first,  assuming  a  fixed  set  of  hospital  facilities  (supply)  in 
the  short  run,  observed  case  mix  as  a  function  of  exogenous  and/or  pre- 
determined demand  variables,  such  as  population  age,  income,  insurance  coverage, 
etc.     The  problem  with  this  approach  is  that  while  groups  are  being  formed  with 
individual  institutions,  demand  variables  are  available  only  on  a  county-wide 
basis.     The  result  of  that  data  constraint  is  to  interject  a  strong  geographic 
bias  into  the  system.     That  is,  all  hospitals  located  in  the  same  county  will 
have  identical  values  for  all  of  the  grouping  variables  and  will  thus  be  grouped 
together.     What  is  true,  however,  is  that  a  hospital's  market  area  is  not 
necessarily  equivalent  to  county  boundaries.     Specialty  hospitals  may  draw  from 
a  regional  or  even  national  market  (although  a  subset  of  that  population)  while 
local  general  hospitals  may  serve  less  than  the  entire  county.  Unfortunately, 
these  catchment  areas  are  difficult  to  define  and  appropriate  data  correspond- 
ing to  their  boundaries  are  not  available. 

The  specific  measures  to  be  used  in  this  approach  are  the  following: 
MEDIAN  FAMILY  INCOME 

(CENSUS  1970  FEMALE  POPULATION  AGE  15-24  +  CENSUS  1970  FEMALE  POPULATION 
AGE  25-35  +  CENSUS  1970  FEMALE  POPULATION  AGE  35-44)  t  POPULATION  SIZE 

PERCENTAGE  OF  POPULATION  AGE  0-5  1970 

PERCENTAGE  OF  FAMILIES,   INCOME  LESS  THAN  $4,000 

LABOR  FORCE  PARTICIPATION  RATE  AGE  16+ 

OB-GYN's  PER  10,000  POPULATION 

PRIMARY  CARE  MD*s  PER  10,000  POPULATION 

PERCENTAGE  OF  POPULATION  NON-WHITE 

DISABILITY  RATE  AGES  16-64  (%) 

(PERCENTAGE  OF  POPULATION  AGE  70  AND  OVER  +  PERCENTAGE  OF  POPULATION 
AGE  65-69) 


PERCENTAGE  OF  POPULATION  AGE  60-64 
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MEDICAL  SPECIALISTS  PER  10,000  POPULATION 

OTHER  DIRECT  CARE  SPECIALISTS  PER  10,000  POPULATION 

OTHER  SPECIALISTS  PER  10,000  POPULATION 

SURGICAL  SPECIALISTS  PER  10,000  POPULATION 

E.  2.     Endogenous  Approach 

Whereas  the  first  approach  focused  exclusively  on  measures  exogenous  to  the 
individual  hospital,  the  second  approach  considers  only  endogenous  measures. 
This  approach  assumes  that  case  mix  composition  is  accurately  reflected  by 
the  facilities  and  services  available  in  the  institution.     As  noted  in  earlier 
sections,  this  assumption  is  possibly  not  justified  in  the  short  run  and  almost 
certainly  not  justified  in  the  long  run.     In  essence,  this  approach  allows  too 
much  distinction  among  hospitals  whereas  the  first  approach  allowed  too  little. 

What  is  desired  in  this  approach  is  to  combine  the  available  information  on 
presence  or  absence  of  various  facilities  and  services  in  such  a  way  as  to 
provide  a  meaningful  description  of  the  specialized  asset  composition  of  the 
institution.     Using  46  dummy  variables  (corresponding  to  the  46  facilities 
and  services  which  data  are  available — variables  124-169)  is  neither  practical 
nor  particularly  enlightening.     However,  the  literature  provides  no  consensus 
on  the  appropriate  method  of  combining  the  information  into  fewer,  more  useful 
measures  (e.g.,  a  facilities  index).     Here  again  the  issue  of  variable  weighting 
arises.     Since  different  facilities,  in  general,  have  different  implications  in 
terms  of  their  impact  on  institutional  cost,  it  would  be  inappropriate  to 
weight  them  equally  in  a  system  designed  to  focus  on  cost  homogeneity.  This 
has,  however,  been  the  approch  taken  in  the  literature,  so  that  no  tested 
weighting  scheme  is  available  for  facilities  and  services.     Therefore,  it  is 
proposed  here  to  use  a  modification  of  the  approach  suggested  by  Ralph  Berry 
(1973).     Berry  observes  that  empirically,  hospitals  can  be  classified  into 
five  groups  based  on  the  subset  of  facilities  and  services  offered.     In  the 
present  study,  the  endogenous  measure  of  case  mix  for  each  hospital  will  be 
given  as  the  number  of  facilities  and  services  it  has  in  each  of  Berry's  four 
categories.     Thus,  the  measure  is  a  vector  with  four  elements. 

Again,  it  is  important  to  note  that  neither  of  the  approaches  used  here  are 
appropriate  for  use  in  an  on-going  control  system.  They  are  used  here  only 
for  reasons  of  data  availability. 

F.  Case  Mix  Severity 

The  ideal  measure  of  this  variable  is  something  of  an  empirical  question. 

That  is,  there  is  no  conceptual  argument  pointing  to  the  use  of  age,  pre-existing 

condition,  income  level,  etc.  as  a  measure  of  case  mix  severity.     The  proper 


Only  the  first  four  of  the  five  Berry  groups  (Basic,  Quality  Enhancing, 
Complex,  Community,  and  Special)  will  be  considered,  since  the  last  group  termed 
"special"  contains  services  such  as  chapel,  hospital  auxiliary,  chaplainary,  etc. 
which  will  have  almost  no  effect  on  the  case  mix  of  the  hospital. 
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measures  are  identified  through  the  continual  observation  that  (for  example) 
older  or  diabetic  patients  require  more  resources  to  achieve  the  same  outcome 
for  any  given  condition  holding  constant  other  variables  (e.g.,  price  insurance, 
etc.)  that  might  affect  resource  utilization.     There  are  two  difficulties.  One 
is  that  some  severity  modifiers  (e.g.,  obesity)  might  be  appropriate  only  for 
some  conditions  (e.g.,  abdominal  surgery)  and  not  for  others  (e.g.,  broken  arm). 
The  second  is  that  since  the  modifiers  must  be  identified  empirically,   it  may 
be  difficult  to  distinguish  between  "true"  measures  of  severity  and  taste 
variables . 

Lacking  any  conclusive  evidence  regarding  the  appropriate  set  of  severity 
modifiers  and  recognizing  data  constraints,  the  present  study  will  use  the  fol- 
lowing measures: 

PERCENTAGE  OF  POPULATION  0-5  1970 

PERCENTAGE  OF  FAMILIES  EARNING  LESS  THAN  $4,000 

PERCENTAGE  OF  POPULATION  AGE  70+ 

Since  these  measurements  are  used  as  predictors  of  case  mix  composition  in 

the  exogenous  approach  outlined  above,  they  will  not  be  repeated  in  that  approach 

V.  Evaluation 

It  would  be  desirable  if  the  grouping  system,  once  constructed,  could  be 
evaluated  as  to  its  "goodness"  and/or  compared  to  some  other  grouping  system. 
Obviously,   the  criteria  by  which  the  system  is  evaluated  depend  upon  the 
objectives  of  the  system.     That  is,   if  the  objective  of  the  grouping  system 
is  to  find  the  most  statistically  pleasing  groups  (i.e.,  those  groups  that 
minimize  intra-group  variation  of  the  control  parameter — in  this  case,  cost 
per  unit),  then  evaluation  of  the  proposed  system  amounts  to  testing  the  intra- 
group  variation  and  comparing  it  to  the  variation  arising  from  some  other 
system. 3    However,   if  the  objective  is  to  group  hospitals  according  to  variables 
that  in  a  smoothly  functioning  market  would  yield  intra-group  homogeneity  of 
cost  per  unit,  the  statistical  evaluation  of  the  system  becomes  almost  impossible 
That  is,  the  observation  that  intra-group  variation  is  smaller  in  some  other 
grouping  scheme  cannot  be  taken  as  evidence  that  it  is  a  "better"  system  given 
this  objective.     If  the  market  were  functioning  smoothly,  no  control  program 
would  be  necessary. 

However,   since  the  evidence  indicates  that  there  are  imperfections  in  the  in- 
dustry,  there  can  no  longer  be  the  expectation  that  hospitals  grouped  according 
to  the  criteria  developed  in  this  paper  will  exhibit,  at  the  start  of  the 
program,   similar  levels  of  cost  per  unit.     The  theoretical  implication  is  that 
they  should,  and  the  purpose  of  the  control  program  is  to  see  that  over  time 
they  do.     Thus,   the  grouping  system  developed  here  must  be  evaluated  on  the 
basis  of  its  conceptual  strength  and  the  translation  of  the  conceptual  frame- 
work into  an  implementable  program. 


The  best  system  by  this  definition  includes  cost  per  case  unit  as  a 
grouping  variable. 
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CHAPTER  THREE 


CLASSIFICATION  METHODOLOGY 
I.  Introduction 

A  system  of  prospective  reimbursement  discussed  in  the  previous  chapter  consists 
of  three  major  elements:     (1)  selecting  appropriate  payment  units  (e.g.,  per 
diem  costs,  per  case  costs,  etc.)  and  reimbursable  costs,   (2)  classifying 
hospitals  into  groups  homogenous  by  external  factors,  and  (3)  establishing 
reimbursement  formulas  for  each  identified  hospital  group  on  the  basis  of  deter- 
mined payment  units.     The  previous  chapter  addressed  the  first  aspect  and 
identified  the  relevant  economic  factors  which  can  be  used  to  classify  hospitals 
(See  Table  2.3);  this  chapter  will  describe  the  statistical  methodology,  gen- 
erically  known  as  cluster  analysis,  which  will  arrange  hospitals  into  groups  such 
that  hospitals  in  the  same  group  are  more  alike  with  respect  to  these  factors 
than  hospitals  in  different  groups. 

The  classification  methodology  consists  of  the  four  parts  represented  in  Figure 
3.1.     The  initial  part  requires  determining  which  constraints,  if  any,  will  be 
imposed  a  priori  on  the  classification  process;  e.g.,  is  there  a  minimum  or 
maximum  number  of  groups  required,  are  some  hospitals  not  allowed  in  the  same 
group,  is  there  a  limit  imposed  on  the  number  of  singly  grouped  hospitals  (iso- 
lates), etc.?~^    Constraints  which  may  be  imposed  in  this  part  of  the  process 
might  be  a  function  of  the  type  of  reimbursement  system  adopted.     For  example, 
if  rate  setting  formulas  are  based  on  the  performance  of  hospitals  in  a  group, 
it  may  be  imperative  that  each  group  contain  a  minimum  number  of  hospitals  in 
order  to  have  a  significant  sample.     In  addition,  the  number  of  isolates  might 
be  constrained,  for  example,  in  order  to  limit  the  number  of  hospitals  which 
would  have  to  be  treated  on  an  individual  basis.     Other  constraints  may  be 
dictated  by  practical  considerations  of  administering  the  reimbursement  system. 

The  second  part  entails  calculation  of  the  similarity  measures;  i.e.,  the 
calculation  of  precise  quantities  measuring  homogeneity  between  all  pairs  of 
hospitals  based  on  the  relevant  economic  variables  previously  identified. 
The  third  part  of  this  study  utilizes  the  similarity  measures  to  determine  a 
hierarchy  of  groups  (called  a  dendrogram) .     This  hierarchy  displays  the  resultant 
progress  of  the  cluster  analysis  algorithm  as  it  proceeds  in  combining  individual 
hospitals  to  (ultimately)  one  group  of  all  sampled  hospitals. 

The  final  step  analyzes  this  hierarchy  of  hospitals  to  determine  the  best 
hospital  partition,  the  tradeoffs  between  the  number  of  groups  and  total  homo- 
geneity, and  statistical  validation  of  these  groups  using  both  parametric  and 
non-parametric  procedures.     Again,   the  type  of  reimbursement  system  adopted  may 
affect  this  process  of  analyzing  the  hierarchy  of  hospitals.     For  example,  if 


For  this  study,  no  constraints  were  imposed  other  than  restricting  our 
examination  to  short-term  general  hospitals — a  function  of  the  sample  constituting 
our  data  base. 
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FIGURE  3.1 

Clustering  Methodology 
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the  economic  framework  requires  that  hospitals  in  different  groups  should  have 
different  cost  structures,  then  hospitals  can  be  combined  to  the  point  where 
the  remaining  groups  all  have  significantly  different  cost  structures. 


A.     Problem  Definition 

To  examine  the  concept  of  a  cluster  hierarchy  or  dendrogram  in  more  detail  and 

to  define  the  clustering  problem  more  precisely,  a  formal  statement  of  the 

problem  can  be  given.     Assume  there  exists  a  set  of  n  hospitals  H  =[H,  ,  Hi, 

In 

where  each  hospital  H    is  described  by  a  (p  x  1)  vector  of  variables  (listed  in 

Table  2.3),        =  [y.Q>  5^2'   *  *  * '  yi  ^  *     For  examPle»  Y-q  might  represent  input 
factor  prices  for  the  i1"*1  hospital,  y„    might  represent  the  degree  of  unioniza- 
tion, etc.     Since  no  unique  measurement  is  available  for  each  variable,  several 
measures  are  used  as  surrogates  (indicated  in  Table  2.4);  thus,  each  variable 
y  .  for  the  i1"^1  hospital  is  represented  by  a  vector  of  measures  or  characteristics 

J  .  th 

] ,  where  q.   is  the  number  of  measures  used  for  the  i 
ijl      ij2'  ijq^  Hj 

variable. 

Given  these  vectors  of  measures  and  weights  corresponding  to  the  variables, 

w.,  we  then  wish  to  find  a  set  G  =  {G  ,  G  ,  G  }  of  k  groups  or  clusters 

J  X       Z  k. 

such  that  hospitals  in  the  same  group  are  more  "alike"  than  hospitals  in  differ- 
ent groups  with  respect  to  the  weighted  measures,  where  the  set  G  partitions 
H;  i.e., 

Gq  CH      for  qeQ  =  {l,  2,  k} , 


U 


G    =  H 

q 

qeQ 

and      G^H  Gs  =  <f>      for  all  q  +  s£Q 

(that  is,  all  hospitals  will  be  placed  in  a  group  and  no  hospital  will  be  in 
more  than  one  group). 

To  examine  the  relationship  between  the  level  of  homogeneity  and  the  number 
of  groups  (k) ,   it  is  helpful  to  examine  the  results  of  a  typical  hierarchical 
clustering  scheme. ^    If  the  number  of  groups  (k)  is  not  specified  beforehand, 
a  hierarchical  scheme  will  generate  a  pattern  of  cluster  combinations  often 
called  a  dendrogram.     While  specific  measures  are  discussed  below,  it  follows 
from  the  definition  of  clustering  itself  that  any  monotonically  increasing 
measure  of  homogeneity  within  clusters  (I)  will  decrease  as  more  hospitals  are 
combined  and  a  similarly  defined  measure  of  heterogeneity  within  clusters 
(which  we  will  denote  V  )  will  increase  as  combinations  take  place.     An  example 
of  such  a  dendrogram  representing  five  hospitals  is  presented  in  Figure  3.2. 


32 

For  the  purpose  of  this  discussion,  we  will  represent  the  level  of 
within  group  homogeneity  as  "£". 
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FIGURE  3.2 

Dendrogram  Definition 
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Thus,  there  exists  a  relationship  between  the  level  of  within-group  homogeneity 
(£)  and  the  number  of  groups  (k)  for  any  dendrogram  generated  by  a  hierarchical 
algorithm.     If  k  is  predetermined, 

£(k)  =  f(G) 

and  if    £  is  predetermined, 

k  (£)  =  g  (G). 

For  example,  given  a  level  of  homogeneity  £^,  a  vertical  line  drawn  through  the 
dendrogram  at  that  point  indicates  three  groups;  conversely,  assuming  three 
groups  a  priori  indicates  a  range  of  homogeneity  from  £^  to  £  .  Determining 
these  functions,     f(')  and  g('),  defines  the  task  of  the  clustering  methodology. 
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II.     Measurement  Selection,  Weighting,  and  Similarity  Measure  Computation 

As  previously  indicated,  only  those  variables  which  cause  "legitimate"  cost 
differences  should  be  used  for  determining  homogeneous  hospital  groups  in  order 
for  a  reimbursement  system,  based  on  those  groups,  to  be  effective.     Thus,  having 
identified  the  relevant  group  of  exogenously  determined  variables,  the  selection, 
weighting,  and  use  of  measurements  describing  those  variables  will  be  discussed 
here. 33 

Determining  the  relative  importance  of  grouping  variables,  as  represented  by 
their  respective  weights,  w  ,  is  a  tenuous  task  at  best.     In  general,  weights 
may  be  derived  from  two  sources:     (1)  explicit  weights  added  subjectively  by 
the  analyst  and  (2)  implicit  weights  determined  by  differences  in  measurement 
scales,  multicollinearities,  etc.     Most  authors  (Anderberg,  1973;  Sokal,  1974) 
argue  that  all  implicit  weights  should  be  removed  before  the  clustering  process 
begins;  therefore,  preclustering  calculations  will  attempt  to  equalize  the 
relative  importance  of  all  selected  characteristics  to  unit  weight.  Subsequently 
subjective  weighting  schemes  can  be  tested  for  determining  the  ex  post  facto 
sensitivity  of  resultant  cluster  definitions. 

A.     Hospital  Similarity  Measures  Defined 

The  clustering  process  is  most  often  initialized  by  calculating  a  matrix  of 
similarity  measures  between  all  pairs  of  hospitals  being  clustered,  where  the 
similarity  measure  between  any  two  hospitals  represents  a  composite  score  based 
on  the  selected  characteristics  or  measurements  describing  each  hospital. 
While  an  almost  unlimited  variety  of  similarity  measures  has    been  suggested 
(for  a  complete  discussion,  see  Sokal  and  Sneath,  1975),  many  similarity  measures 
can  be  dismissed  on  theoretical  grounds.     For  example,  a  large  number  of  measures 
(known  as  coefficients  of  association)  have  been  developed  for  dealing  with 
strictly  dichotomous  data.     Other  measures,  based  on  the  product-moment  corre- 
lation coefficient,  present  interpretation  problems  and  are  rarely  used  for 
comparing  objects. 34 


As  commonly  used,   these  measurements  or  characteristics  themselves  may  be 
considered  to  be  variables.     In  order  to  avoid  confusion,  however,  we  will  con- 
tinue to  use  the  term  "variable"  to  describe  the  factors  indicated  by  the  main 
headings  of  Table  2.3,  and  the  interchangeable  terms,  "measurement"  or 
"characteristic"  to  indicate  those  specific  empirical  quantities  which  represent 
the  variable  in  any  particular  instance. 

34 

The  exact  meaning  of  the  correlation  coefficient  between  two  hospitals 
would  be  difficult  to  ascertain;  the  correlation  coefficient  implies  that  a  per- 
centage of  variation  in  one  hospital's  characteristics  can  be  explained  by 
variation  in  another  hospital's  characteristics.     If  we  are  comparing  two  hospi- 
tals 's  sizes,  for  example,  it  would  assume  that  one  hospital's  size  is  some 
function  of  the  other's — a  tenuous  assumption  at  best.     Furthermore,  an  absolute 
comparison  between  variables  cannot  be  made;  a  correlation  measure  gives  a 
relative  comparison  only.     If  two  vectors  are  parallel  the  correlation  coefficient 
will  be  |l|  — even  though  the  distance  (and  hence  dissimilarity)  between  the  vectors 
(and  hospitals)  may  be  quite  large.     Furthermore,  vectors  do  not  need  to  be 
parallel    for  the  correlation  coefficient  to  be  equal  to  unity;  as  long  as  some 
linear  relationship  exists  between  the  two,  the  correlation  coefficient  will  be 
equal  to  | 1 | . 
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Given  the  continuous  nature  of  the  data  set  here,  each  hospital  can  be  described 


by  a  vector  of  m  (where  m  = 


3-1 


q.)  identified  characteristics  and  represented 


by  a  single  point  in  m-space,  which  corresponds  to  the  respective  values  for 

the  m  characteristics.     Then,  a  similarity  measure  between  any  two  hospitals, 

say  H    and  H  ,  can  be  defined  as  the  distance  D(r,s)  between  the  rfc^  and  s1"*1 
r  s 

points  representing  the  respective  hospitals.     Distance  functions  D(r,s), 
however,  are  not  uniquely  defined;  however,  if  they  meet  the  following  three 
criteria,  , 


and 


D  (r,s)  =  0 
D  (r,s)  =  D  (s,r) 
D  (r,s)  <  D  (r,t)  +  D  (t,s) 


if  and  only  if  H    =  H 
r  s 


(3.1) 
(3.2) 
(3.3) 


then  the  distance  function  is  said  to  be  a  metric. 

The  best  known  and  most  widely  studied  metrics  are  the  Minkowski  metrics;  for 
hospitals  H    and  H  ,  each  described  by  an  (m  x  1)  vector  of  characteristics, 

IT  S 

rll  rlq1      r21  r2q2'  rpl'  rpcip 

[x  x    .   ,  x  01 ,   . . . ,  s  „     ,...,x  x        ],  respectively;  the 

sll  slq^     s21'  x2q2'  spl'  spqp 

Minkowski  metric  is  defined  as  follows: 


d(r,s)  = 


j=l  c=l 


x   .     -  X  . 
1  rjc  sjc 


1/P 


for  P  >  1, 


Obviously,  there  are  any  number  of  Minkowski  metrics  for  values  of  P  >_  1. 
Differences  between  the  metrics  can  be  displayed  by  a  graph  showing  unit 
distance  (the  so  called  "unit  ball")  for  any  two  points  x  and  y;  that  is,  all 
points  for  which  the  distance  from  the  origin  is  1  (Figure  3.3). 
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The  three  metrics  indicated  by  the  solid  lines  (the  L^,  L^j  and  L  matrices) 
have  been  subjected  to  the  most  study  and  examination;  these  are  described  below. 

1.     The  L-p  or  "taxicab"  or  "Manhattan"  metric,  can  be  stated  for  P  =  1 
as  follows; 


D(r,s)  =     I  I 
j=l  c=l 


x    .      -  X  . 

1  rjc  sjc 


2.     A  second  metric,   letting  P  =  2,   is  probably  the  most  familiar  as  the 
measure  of  Euclidean  Distance, 


D(r,s)  = 


I      I     (x  .     -  s  .  )2 


1/2 


(3.4) 
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3.     A  third  metric  (letting  P-*00)  is  sometimes  referred  to  as  the 
Chebychev  metric  or  uniform  metric  and  can  be  stated  as, 

^ ,      .       max  i  I 
D(r,s)  =  .        x  .     -  x  . 

j  ,c      rjc  sjc 

A  variation  of  the  Euclidean  distance  is  calculated  by  squaring  (3.4);  this 
squared  distance  is  widely  used,  intuitively  justified,  and  analogous  to 
the  Mahalanobis  distance  (discussed  later)  and,  therefore,  will  form  the  basic 
similarity  measure  used  in  this  study. 

To  illustrate  this  distance  measure,  assume  that  six  hospitals  are  described 
by  two  variables — median  county  family  income  and  the  number  of  quality 
enhancing  services.     In  this  case,  hospitals  can  be  represented  by  six  points 
in  two  dimensional  space  as  shown  in  Figure  3.4.     Assuming  that  each  variable 
is  equally  weighted  and  using  the  squared  Euclidean  distance  (i.e.,  P  =  2)  to 
measure  similarity,  it  is  apparent  that  the  hospitals  represented  by  points 
near  each  other  are  similar  with  respect  to  these  two  characteristics,  and 
hence,  will  have  a  smaller  distance  than  dissimilar  hospitals  (for  example, 
hospitals  H„  and  H„  in  Figure  3.4  are  more  alike  than  hospitals  H„  and  H-  since 
2  2 

D  (2,3)      D  (2,5)).     In  this  example,  the  squared  Euclidean  distance  between 
hospitals  Ify  and        is  easily  calculated,  using  unit  weights  on  each  variable, 
as  follows: 

D2(4,6)  =  (I4  -  I6)2  +  (Q4  -  Q6)2 

=  (17,000  -  5,000)2  +  (5-3)2 
=  144,000,004. 


B.     Distance  Measures  Refined 

While  distance  measures  are  computationally  and  intuitively  straightforward, 
a  number  of  problems  may  arise  if  distance  measures  are  computed  directly 
from  the  data.     ONe  apparent  problem  is  caused  by  differences  in  measurement 
scales.     In  the  example  in  Figure  3.4,  one  characteristic  is  measured  in 
terms  of  number  of  services  provided  while  the  other  is  measured  in  dollars — 
hardly  comparable  measurement  scales.     The  consequence,  therefore,  is  that 
the  income  variable  dominates  the  calculation  of  D  (4,6),  even  though  the 
two  variables  are  theoretically  weighted  equally. 

To  accommodate  scale  differences,  a  number  of  possibilities  exists.     The  most 
commonly  utilized  approach  is  to  standardize  the  raw  data  (that  is,  subtract 
the  characteristic  mean  and  divide  by  the  characteristic  standard  deviation) 
before  computing  distances  between  hospitals.     Calculating  the  mean  y .  and 
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FIGURE  3.4 

Hospital  Representation 
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/\2  th 

unbiased  standard  deviation  s.     for  the  c      measure  of  variable  j, 


n    x .  . 

±19. 


Jc      i=l  n 


n 


S.      =    *    i        ijc  JC 
JC        1=1  J  

n-l 

tli  t  h 

the  squared  Euclidean  distance  between  the  r      and  s      hospitals  can  be  defined 
in  terms  of  the  standardized  measure  z..     as  follows 

•      P      qj  2 
D2(r,s)  =    I      I     (zrjc      Sjc'  (3.5) 
j=l  c=l 

x  .    -  y. 

where         z        =   ^  (3.6) 

rjc  a 
s . 

Another  significant  although  less  obvious  problem  occurs  when  two  or  more 
characteristics  which  describe  a  variable  are  correlated;  metric  distances 
assume  an  orthogonal  space.     When  spaces  are  not  orthogonal,  distances  calcu- 
lated do  not  follow  exactly  from  equation  (3.5).     In  the  example  illustrated 
in  Figure  3.4,  assume  that  a  third  measure,   say  retail  wage  rate,  is  included 
as  an  additional  measure  of  the  income  variable.     In  this  case,  it  is  most 
likely  that  the  retail  wage  rate  and  median  family  income  are  highly  correlated, 
and  one  underlying  wage/income  factor  would  explain  the  variance  in  both  measures, 
Then,   since  two  wage/income  measures  are  being  used,  the  distances  and  similar- 
ities between  hospitals  would  be  unduly  weighted  in  that  direction.     Thus,  it 
is  important  to  detect  any  multicollinearities  between  the  measures  in  order 
to  remove  any  implicit  weighting  in  the  data. 


The  problem  or  correlated  characteristics  was  resolved  by  principal  components 
analysis,  a  technique  which  extracts  those  key  factors  which  are  independent 
or  orthogonal  of  each  other  ^(in  the  above  example,  the  two  income  measures 
are  combined  into  a  single  factor).     If  there  were  only  one  measurement  for 
each  theoretically  defined  variable  and  we  knew  a  priori  the  empirical  produc- 
tion function  and  demand  curve  facing  the  hospital,  there  would  be  no  need  to 
extract  orthogonal  factors;  economic  theory  would  dictate  the  selection  and 
weighting  of  each  variable  and  any  multicollinearities  would  merely  be 
interesting  artifacts  of  the  data.     However,  given  the  imperfect  data  set  used 
in  this  study  and  the  multiplicity  of  measures  for  several  variables,  it 
became  imperative  to  examine  the  measures  using  principal  components  analysis 
in  order  to  test  the  hypothesized  relationship  between  the  measures  and  their 
respective  variables.     Given  that  relatively  well  defined  factors  were  found 
in  this  study,  these  factors  were  used  to  represent  the  variables  and  compute 
the  distance  scores. 


For  a  more  complete  description  of  principal  components  analysis,  see 
Harmon  (1967). 
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An  alternative  procedure  is  based  upon  the  use  of  the  generalized  Mahalanobis 
distance  D^(i,j)   (Mahalanobis,  1936),  which  is  defined  between  hospitals 
r  and  s  as  follows: 

D2(r,s)  =  AT  [S2]"1  A 

x  x 

2 

where  S    is  the  variance-covariance  matrix  and  A    is  an  (m  x  1)  vector  of 

x 

characteristic  differences, 

A    =  [(x  ...  -  x  ..),...,   (x  .     -x.),  (x,  -x.)] 

x  rjl        sjl  rjc        sjc  rjn        sjn  J 

The  Mahalanobis  distance  is  appealing  in  that,  given  a  completely  orthogonal 
space,  the  covariance  matrix  S    reduces  to  a  diagonal  matrix  of  characteristic 
variances.     In  this  case, 


D2(r,s)  =    I      \*  -  z  )2 


j=l  c=i        rJc  SJC 

where  z  .     is  the  standardized  characteristic  defined  in  (3.6).     The  Mahalanobis 
rjc 

distance  then  reduces  to  the  standardized  squared  Euclidean  distance  defined 
in  (3.5)  which  was  the  similarity  measure  of  choice  in  thi^  study  (note  that 
if  the  characteristics  are  standardized  bef ore^compj^ting  D  ,  the  covariance 

matrix        reduces  to  the  identify  matrix  and  D    =  A    A  ) .     in  general  spaces 

xx 

when  measurements  may  be  correlated,  multicollinearity  effects  are  deleted 
2  -1 

by  the  term  [S  ]       (in  general,  the  greater  the  correlation  between  two  ^ 
characteristics,  the  smaller  the  inverse  covariance  weighting).     Thus,  D  (r,s) 
is  equivalent  to  standardizing  the  data,  finding  orthogonal  factor  scores  from 
a  principal  components  analysis,  and  computing  squared  Euclidean  distances. 
Theoretically,  the  use  of  the  Mahalanobis  distance  is  advantageous  due  to  its 
ability  to  use  one  hundred  percent  of  the  variance;  invariably,  principal  com- 
ponents analysis  results  in  some  information  reduction.     Phillip  and  Iyer  (1974) 
used  both  principal  components  analysis  and  the  Mahalanobis  distance;  after 
utilizing  principal  components  analysis  to  reduce  the  original  set  of  character- 
istics, the  Mahalanobis  distance  was  subsequently  computed  in  place  of  the  metric 
distances  previously  suggested. 

In  this  study,  both  approaches  were  tried.     However,  when  the  Mahalanobis 
distance  was  computed,  it  was  found  that  the  inclusion  of  all  measurement 
variables  resulted  in  significant  instabilities  in  the  data.     The  problem  is 
caused  by  the  fact  that  when  covariance  measures  are  linearly  dependent,  the 
variance-covariance  matrix  in  (3.7)  has  less  than  full  rank  and  therefore 
cannot  be  inverted.     Since  subsequent  factor  analysis  showed  strong  linear 
depandencies  in  the  variance-covariance  matrix,   [S  ]  ^  was  calculated  but 
resulted  in  meaningless  values.     Therefore,  the  approach  used  by  Phillip  and 
Iyer  (1974)  for  direct  distance  calculations  from  factor  scores  would  have 
had  to  be  adopted.     Similarity  measures,  however,  were  calculated  from  both 
approaches  (i.e.,  Mahalanobis  distance  and  the  squared  Euclidean  distance 
calculated  from  factor  scores)  in  order  to  verify  the  distances  determined 
(the  two  approaches  resulted  in  values  which  were,  in  fact,  exceedingly  close). 
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C.     Similarity  Measures  -  Summary 

Measures  were  standardized  and  factor  scores  computed  in  order  to  remove  the 
effects  of  scale  differences  and  multicollinearities  among  measures,  respec- 
tively.    Since  several  measures  were  used  to  represent  most  key  classification 
variables,  failure  to  compute  orthogonal  distances  would  have  given  greater 
weight  to  the  correlated  measures. 

Once  orthogonal  factors  are  found  (and  implicit  weights  removed),  explicit 
weights  could  be  added  to  the  factors  to  take  account  of  the  fact  they  key 
variables,  now  represented  by  the  factors,  do  not  have  equal  impact  on  hospital 
cost  structure.     Two  sets  of  weights  were  used  for  each  approach.     The  first 
set  used  unit  weights  to  reflect  equal  weighting  on  all  variables;  the  second 
set  of  weights  was  determined  from  regression  analysis,  using  cost  per  case 
as  the  dependent  variable.     (The  precise  calculation  of  these  latter  weights 
is  described  in  the  following  chapter.) 

In  all  cluster  analyses  performed  in  this  study,  similarity  measures  between 
hospital  pairs  were  computed  using  the  following  steps: 

1.  Find  the  mean  and  standard  deviation  for  each  characteristic  and 
compute  standardized  measures  Z..     from  (3.6). 

2.  Factor  analyze  the  standardized  characteristics;  reject  any  factors 
with  eigenvalues  less  than  1.0.^ 

3.  Using  the  matrix  of  reduced  factor  score  coefficients  and  exogenously 
determined  factor  weights,  calculate  the  weighted  factor  scores  for 
each  hospital. 

4.  Using  the  weighted  factor  scores,  find  the  squared  Euclidean  distance 
between  all  pairs  of  hospitals  from  (3.5). 

These  steps  result  in  a  square  matrix  of  order  n  containing  distance  measures 
representing  similarities  between  all  pairs  of  hospitals  in  orthogonal  space. 
Since  the  matrix  is  symmetrical  by  (3.2)  and  the  diagonal  elements  are  zero 

by  (3.1),  only  the  n^  ^  elements  in  the  upper  triangular  part  of  the  matrix 

need  to  be  calculated  and  retained.  An  example  of  such  a  similarity  matrix, 
which  will  be  used  as  an  illustration  in  the  following  sections,  is  shown  in 
Figure  3.5. 


Eigenvalues  may  be  loosely  interpreted  as  a  measure  of  explained  variance 
an  eigenvalue  cutoff  point  of  1.0  (percent)  was  arbitrarily  selected  to  agree 
with  the  default  value  specified  by  the  SPSS  (Statistical  Package  for  Social 
Sciences)  program. 
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FIGURE  3.5 


Similarity  Matrix 
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III.     Cluster  Dendrogram  Determination 


Given  the  hospital  representations  and  the  matrix  of  similarity  measures 
described  in  the  previous  section,  the  next  step  of  the  clustering  methodology 
is  hospital  group  determination.     Even  knowing  these  representations,  however, 
the  problem  of  detecting  homogeneous  groups  may  still  be  a  most  problematic 
one.     For  example,  Figure  3.6  illustrates  these  cases  which  may  occur: 
(1)  a  number  of  distinct  groups  exist  which  can  be  delineated  by  linear  func- 
tions, (2)  a  number  of  distinct  groups  exist  which  cannot  be  delineated  by 
linear  functions,  and  (3)  no  distinct  groups  appear  evident  other  than  the 
single  isolated  hospital.    While  most  classification  problems  encountered  in 
the  real  world  fall  into  the  third  case,  even  problems  in  the  first  two  cases 
remain  exceedingly  difficult  to  resolve.     To  illustrate  this  difficulty, 
examine  the  first  case  in  Figure  3.4,  where  the  number  of  hospitals  (6)  and 
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apparent  groups  (2)  is  limited.     In  this  limited  case,  searching  all  possible 
partitions  using,  say,  linear  discriminant  analysis,  would  entail  examining 
31  possible  partitions  (enumerated  in  Table  3.1). 37    Not  knowing  the  value  of 
k  (the  number  of  groups)  beforehand,  one  would  have  to  examine  the  sum  of 
a  series  of  Stirling  numbers  of  the  second  kind;  in  this  example,  k  would  vary 
from  1  to  6  and  a  total  of  203  possible  partitions  would  have  to  be  examined. 
For  this  reason  of  combinatorial  complexity,  only  heuristic  algorithms  will 
be  used  in  order  to  maintain  computational  feasibility . ^ 

Heuristic  clustering  strategies  can  be  subdivided  into  hierarchical  methods, 
iterative  methods,  and  ad  hoc  methods  as  represented  in  Figure  3.7.  Hierarchi- 
cal methods  either  begin  with  all  hospitals  in  individual  groups  and 
subsequently  combine^ these  groups  in  some  fashion  (agglomerative  clustering), 
or  initially  place  all  hospitals  in  one  group  and  subsequently  split  the 
initial  and  following  groups  in  such  a  way  as  to  satisfy  some  criterion  at  each 
stage  (divisive  clustering).     Agglomerative  clustering  begins  with  n  groups 
and  ends  with  one;  divisive  clustering  starts  with  one  group  and  ends  with  n. 
Agglomerative  methods  (represented  in  Figure  3.2)  can  be  further  subdivided 
on  the  basis  of  clustering  criteria;  linkage  methods  examine  the  total  of 
some  within  group  measure.     Iterative  methods,  on  the  other  hand,  normally 
require  assuming  a_  priori  a  value  of  k  (the  number  of  groups)  and  begin  with 
some  arbitrary  partition  of  hospitals  into  k  groups  and  proceed  to  rearrange 
the  hospitals  in  order  to  decrease  the  level  of  homogeneity  among  all  groups. 

Given  that  divisive  methods  have  some  theoretical  disadvantages  which  make 
them  less  desirable  than  agglomerative  methods  (Gower,  1967),  the  use  of  such 


37 

In  general,  for  n  hospitals  and  k  groups,  the  number  of  possible  partitions 

is  equal  to  S^,  a  Stirling  number  of  the  second  kind,  where 
n 


p=0 


It)  n 
P 


and 


is  the  number  of  combinations  of  k  objects  taken  p  at  a  time,  equal  to 


k! 


p!  (k-p)\ 


38 

While  a  number  of  optimization  approaches  to  the  clustering  problem  have 
been  proposed,  none  has  been  demonstrated  to  be  computationally  feasible  for 
anything  other  than  trivial  problems.     For  a  description  of  several  suggested 
optimization  algorithms,  see  Klastorin  (1973). 
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FIGURE  3.6 

Classification  Illustrations 
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TABLE  3.1 

Possible  Partitions 
(N  =  6) 

TWO  GROUPS 


Group  1 

Group  2 

1. 

(1) 

(2,3,4,5,6) 

2. 

(2) 

(1,3,4,5,6) 

\     y     y  '  y  ~*  7  / 

3. 

(3) 

(1,2,4,5,6) 

4. 

(4) 

(1 ,2,3,5,6) 

5. 

(5) 

(1,2,3,4,6) 

6. 

(6) 

(1,2,3,4,5) 

\  ■""  7  *~  7  ~*  7        7  / 

7. 

(1,2) 

(3,4,5,6) 

8. 

(1,3) 

(2,4,5,6) 

\         J         7         7  / 

9. 

(1,4) 

(2,3,5,6) 

\         7         7  ~*   7  / 

10. 

(1,5) 

(2,3,4,6) 

\     5     7     y  / 

11 . 

(1,6) 

(2,3,4,5) 

12. 

(2.3) 

(1 ,4,5  ,6) 

13. 

(2  .4) 

(1 ,3,5,6) 

14. 

(2.5) 

(1 .3 .4 .6) 

15 . 

(2  6) 

(1.3.4.5) 

16 . 

(3  4) 

(1  2  5.5) 

17. 

(3  5) 

(1.2  4.6) 

\  x  y  *~  >"  7^/ 

18. 

(3,6) 

(1 ,2,4 ,5) 

19. 

(4,5) 

(1,2,3,6) 

20. 

(4,6) 

(1,2,3,5) 

21. 

(5,6) 

(1,2,3,4) 

22. 

(1,2,3) 

(4,5,6) 

23. 

(1,2,4) 

(3,5,6) 

24. 

(1,2,5) 

(3,4,6) 

25. 

(1,2,6) 

(3,4,5) 

26. 

(1,3,4) 

(2,5,6) 

27. 

(1,3,5) 

(2,4,6) 

28. 

(1,3,6) 

(2,4,5) 

29. 

(1,4,5) 

(2,3,6) 

30. 

(1,4,6) 

(2,3,5) 

31. 

(1,5,6) 

(2,3,4) 
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algorithms  was  not  considered.     On  the  other  hand,   iterative  algorithms  would 
be  relatively  undesirable  for  examining  the  tradeoffs  between  the  number  of 
groups  and  total  homogeneity.     Another  disadvantage  encountered  with  iterative 
methods  is  imposed  by  the  nature  of  the  iterative  algorithms;  most^  are  based 
on  simple  pairwise  exchanges  from  an  initial  (and  often  arbitrary)  partition. 
Recent  evidence  (Cormack,  1973)  indicates  that  these  procedures  may  frequently 
terminate  at  poor  solutions. 

On  balance,  heuristic  agglomerative  algorithms  appear  most  appropriate  for 
use  in  an  examination  of  grouping  patterns.     In  addition,  it  would  be  desirable 
to  use  a  number  of  these  clustering  strategies  for  the  purpose  of  establishing 
consistency  requirements  and  validating  clusters  found.     In  other  words,  given 
the  heuristic  nature  of  the  algorithms  used,  additional  credibility  could  be 
attached  to  groups  simultaneously  identified  by  several  algorithms.  This 
concept  forms  the  basis  for  the  development  of  the  composite  dendrogram,  which 
will  be  explored  in  detail  in  a  later  section. 

A.     Agglomerative  Clustering 

We  now  turn  to  the  examination  of  several  algorithms  which,  based  on  the 
similarity  matrix  S,  calculate  successively  inclusive  group  structures.  (Section 
I  briefly  alluded  to  such  agglomerative  schemes  by  displaying  a  typical  result- 
ant dendrogram.)  following  this  discussion,   it  will  be  shown  how  the  results 
from  these  algorithms  can  be  used  to  compute  a  composite  dendrogram. 

A.l.     Linkage  Methods 

Linkage  methods  are  both  the  simplest  and  the  most  widely  used  clustering  methods. 
Essentially,  there  are  three  categories  of  linkage  methods — single  linkage, 
complete  linkage,  and  average  linkage — whose  differences  will  be  examined  in 
this  section. 

All  linkage  methods  begin  and  proceed  in  a  similar  manner;  the  difference 
between  methods  is  the  cricerion  used  to  merge  groups  of  hospitals  at  given 
stages.     Given  n  hospitals  and  a  symmetrical  distance  matrix  |D(r,s)|,  these 
algorithms  begin  by  assigning  each  hospital  to  a  distinct  group.     Letting  £ 
be  a  measure  of  homogeneity  within  groups  and  representing  each  of  the  n  groups 
by  {H.}j   the  development  a  typical  agglomerative  algorithm  can  be  followed  by 
examining  the  dendrogram  in  Figure  3.2.     As  £  is  increased  from  zero  (or  some 
small  number)  groups  are  combined  if  and  only  if  their  respective  distances 
are  less  than  or  equal  to  the  value  of  £.     In  the  procedures  used  in  this  study, 
£  is  defined  by  examining  the  minimum  distance  required  for  a  given  algorithm 
to  combine  all  hospitals  into  one  group.     This  value  is  then  divided  into  25 
equal  intervals  and  £  is  incremented  25  times;  each  increment  level  (labeled 
from  1  to  25  on  the  computer  output)  is  referred  to  as  the  cluster  or  class 
level . 


Included,  for  example,  would  be  such  well  known  algorithms  as  MacQueen's 
"K-means"  technique  (MacQueen,  1967),  and  ISODATA  (Ball  and  Hall,  1965). 
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A.l.a.     Complete  Linkage 

The  complete  linkage  clustering  criteria  dictates  that  group  {h  },  for  example 

would  be  combined  with  group  {H  ,  H.}  if 

si 

2  2 
max  |D  (r,s),  D  (r,t)|  <  £ 

for  any  level  of  homogeneity  £.     Equation  (3.8)  indicates  that  any  two  groups 
are  combined  if  and  only  if  distances  between  all  pairs  of  hospitals  in  the 
two  groups  are  less  than  or  equal  to  £.     The  complete  linkage  algorithm  results 
in  compact  clusters;  combining  groups  only  when  all  links  between  hospitals 
are  less  than  the  level  of  homogeneity  £  is  equivalent  to  minimizing  the  within 
group  diameter  at  each  stage  of  the  algorithm.     In  this  case,  each  cluster  can 
be  pictured  as  an  m-dimensional  sphere  with  the  largest  intra-cluster  distance 
as  the  diameter. 

The  complete  linkage  algorithm  has  been  widely  used  and  offers  the  distinct 
advantage  of  being  invariant  to  monotone  transformations  of  the  data.  This 
property  is  especially  important  in  this  study  where  credibility  can  be  esta- 
blished only  on  the  ordinal  ranking  of  some  measurements.     The  complete  linkage 
algorithm  can  be  illustrated  by  the  similarity  matrix  in  Figure  3.5  and  the 

dendrogram  in  Figure  3.8.     In  this  case,  a  search  of  the  n^  —  similarity 

measures  indicates  that  the  minimum  value  is  12.5;  thus,  £  must  be  increased 
to  12.5  before  any  of  the  hospitals  (in  this  case,        and  H„)  are  grouped  to- 
gether at  cluster  level  1.     Continuing  the  search,  £  might  be  increased  to  13.0 
(the  next  smallest  value),  but  no  additional  grouping  would  take  place  as  all 
pairwise  distances  (i.e.,  D  (2,3),  D2(3,4))  are  not  less  than  or  equal  to  13.0 
(here,  D2(3,4)  =  13.5).     Thus,  for  another  merger  to  take  place,  £  must  be 
increased  to  13.5  and  group  {l^,  H^}  and  {H^}  will  be  merged  as  indicated  at 
level  2  in  the  dendrogram.     The  search  continues  in  this  manner;  when  £  =  16.0, 
the  groups  {l^,  H^,  H^,}  and  {H^}  are  combined  as 

min  |D2(i,2),  D2(l,3),  D2(l,4)|  <  16.0. 

When  £  =  18.0,  groups  {H,.}  and{H^}  are  combined  (level  5);  all  hospitals  are 
not  combined  until  £  >_  40.0  at  level  9.     (In  the  algorithms  actually  used,  the 
process  would  stop  at  level  9  when  all  hospitals  are  merged  into  one  group.) 

A.l.b.  Single  Linkage 
If  the  clustering  criterion  (i.e.,  when  groups  {H  }  and  {H  ,  H  }  are  combined) 

•  IT  S  u 

IS 

i   2  2 
min  |D  (r,s),  D  (r,t) |  <  £, 

the  method  is  known  as  single  linkage.  The  above  equation  indicates  that  two 
groups  will  be  combined  as  long  as  at  least  one  link  between  hospitals  in  one 
group  and  hospitals  in  the  other  group  is  less  than  £. 
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FIGURE  3.8 

Example  Dendrogram  for  Six  Hospitals 


EXPECTED  DISTINCTIVENESS 
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Single  linkage  methods  often  lead  to  a  result  known  as  chaining.     For  example, 
if  D2(r,s),  D2(s,t),  D^(t,u),  and  D^(u,v)  are  less  than  £,  all  five  hospitals 
would  be  combined  in  one  group    Hr,  Hs,  Ht,  Hu,  Hv    even  though  the  distance 
between  hospital  H    and  hospital  H    may  be  as  large  as  4£   (assuming  the  similar- 
ity measure  is  a  metric  so      that  the  triangle  inequality  holds) .     This  cluster 
is  pictured  in  Figure  3.9. 


The  effect  of  chaining  often  leads  to  a  criticism  that  single  linkage  methods, 
while  not  necessarily  resulting  in  compact  clusters,  do  not  yield  sufficient 
information  about  the  structure  of  the  cluster  itself  (Wishart,  1970).  While 
some  controversy  does  exist  with  respect  to  the  usefulness  of  these  serpentine 
clusters  (Cormack,  1973),  the  decision  was  made  to  exclude  the  use  of  this 
algorithm  in  this  study.     Given  the  simultaneous  use  of  multiple  algorithms 
and  the  nature  of  the  hospital  data,  it  appears  that  a  single  linkage  approach 
would  be  inappropriate. 
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A. I.e.     Average  Linkage 

If  two  existing  clusters  are  merged  when  the  average  distance  measure  within 
a  new  cluster  is  less  than  £,  the  process  is  called  average  linkage. 

A  variation  of  this  criterion  has  been  suggested  by  Lance  and  Williams  (1966), 
who  proposed  to  examine  only  links  between  the  two  candidate  groups  and  merge 
if  the  average  distance  is  less  than  JL 

Anderberg  (1973)  reports  that  the  results  obtained  from  these  methods  are, 
not  surprisingly  quite  close  and,  in  general,  give  results  not  too  dissimilar 
from  those  obtained  by  complete  linkage  methods. 

A. 2.     Centroid  Methods 

Centroid  methods  are  similar  to  the  concept  of  average  linkage.  Initially 
hospitals  {Hr}  and  {Hg}  are  combined  if  D^(r,s)  is  the  minimum  of  all  inter- 
hospital  distances;  the  rth  and  s      columns  (and  rows)  in  the  similarity 
matrix  are  then  replaced  by  one  (average)  vector.     The  process  is  then  repeated 
on  the  (n-1)  x  (n-1)  similarity  matrix;  the  two  groups  with  the  smallest  dis- 
tance are  joined.     The  distance  between  clustered  groups  at  each  stage  provides 
a  measure  of  homogeneity  J£;  however,  centroid  methods  do  not  necessarily 
guarantee  that  the  distance  function  will  monotonically  decrease  (the  cluster 
centroids  may  change  or  float  with  the  merge  of  each  new  group) .     A  primary 
difference  between  centroid  methods  and  linkage  methods  is  that  the  former 
methods  describe  clusters  at  any  stage  by  the  differences  between  a  single 
vector  of  scores  (centroids),  whereas  the  latter  methods  describe  clusters 
by  the  differences  between  the  elements  (i.e.,  hospitals  of  the  clusters.) 

Sokal  and  Sneath  (1973)  and  Sokal  and  Michener  (1958)  were  among  the  first 
to  describe  centroid  methods  based  on  the  arithmetic  mean.     Replacing  several 
hospitals  in  two  groups  by  their  single  joint  mean  (i.e.,  centroid)  has  the 
effect  of  weighting  each  prior  group  by  the  number  of  hospitals  in  that  group. 
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While  Sokal  and  Michener  (1958)  argue  the  plausibility  of  such  weights, 
Gower  (1967)  offers  a  scheme  utilizing  the  median  in  place  of  the  mean.  In 
the  latter  approach,   if  groups  G    and  G    are  used  to  represent  the  new 
vector  G  . 


A. 3.     Ad  Hoc  Methods 

There  have  been  few  optimization  algorithms  in  hierarchical  clustering. 
Most  optimization  routines  (most  notably  Ward,  1963;  and  Ward  and  Hook,  1963) 
suboptimize  in  the  sense  of  minimizing  (or  maximizing)  some  criterion  at 
each  stage  and  thereby  avoid  determining  a  final,  global  optimum  partition. 
Mention  of  optimization  desirability  was  made  in  1958  by  W.  Fisher  who,  in 
order  to  reduce  the  complexity  of  the  problem  restricted  himself  to  analyzing 
single  dimensional  vectors  (reducing  the  vector  of  characteristics  to  a  single 
measurement  x.  for  the  i      hospital).     Assigning  the  i      hospital  to  the  r 
group  (^r^  Fisher  defined  homogeneity,   synonymous  with  cluster  compactness, 
for  the  p      group  as: 


I 


%  2 
(x.  -  G  ) 


keG        1  r 


(where  G^  is  the  mean  value  for  group  G^) .     The  measure  of  homogeneity  expressed 
above  is  the  sum  of  the  square  Euclidean  distances  for  all  units  in  the  r 
group  from  the  mean  G  ,  more  often  simply  designated  as  the  "within  group  sums 
of  squares."     From  £  f  several  aggregate  measures  can  be  derived,   the  most 
obvious  being  the  total  within  group  sums  of  squares. 


k 

Z  =    I  Z 
r=l  r 

and,  as  a  second  alternative,   the  average  total  sum  of  squares,  Z/k. 

An  algorithm  proposed  by  Ward  (1963)  generalized  Fisher's  problem  to  m  dimen- 
sions.    While  Ward's  algorithm  accommodates  any  functional,   it  is  best  known 
by  his  example,  minimizing  the  total  sums  of  squared  deviations  about  the 
group  mean. 

k 

Ward's  measure  of  homogeneity  (Z  =     £     £  )  differs  from  Fisher's  by  varying 

r=l  r 

the  value  of  k  and  thereby  constituting  a  hierarchical  algorithm  (Fisher 
assumed  a  fixed  k) .     Ward's  algorithm,  using  Z  as  a  stage-wise  criterion,  can 
be  expressed  as: 

k 

find  min  Z  =     £     E,      for  all  k  =  n,  n— 1,  1. 
r=l  r 
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Determining  a  global  minimum  for  Z  would  require  complete  enumeration  and  cal- 
culation of  all  possible  values  of  Z  at  each  stage.  Ward's  algorithm,  however 
moves  from  the  ktn  stage  (i.e.,  k  clusters  exist)  to  the  k-lst  stage  by  only 

examining  the  ^  possible  pairwise  combinations  at  each  stage  reducing  the 

number  of  clusters  at  each  stage  by  one.  While  this  procedure  results  in 
minimal  Z  at  each  stage,  it  does  not  guarantee  a  global  minimum.  Ward's 
method,  demonstrated  by  the  error  sum  of  squares  criterion,  may  be  calculated 
using  various  objective  functions. 


B.     Composite  Dendrogram  Calculation 

Most  studies  using  cluster  analysis  proceed  by  selecting  one  of  the  cluster 
heuristics  described  in  the  previous  section — the  algorithm  selection  often 
based  on  computer  code  and/or  programming  availability  (for  example,  the  study 
by  Phillip  and  Iyer  (1972)  used  the  complete  linkage  agglomerative  algorithm) . 
In  some  cases,  results  from  more  than  one  algorithm  are  evaluated  and,  on  the 
basis  of  some  evaluative  criteria,  one  set  of  results  is  accepted  over  another 

Given  the  combinatorial  nature  of  the  basic  clustering  problem,  the  heuristic 
nature  of  computationally  feasible  algorithms,  and  the  desire  to  minimize 
the  probability  of  incorrect  hospital  grouping,  it  was  decided  here  to  use  a 
number  of  algorithms  and  combine  their  results  in  a  composite  approach. 
Attempting  to  select  a  unique  resultant  dendrogram  based  on  some  evaluation 
criteria  did  not  prove  feasible  in  this  study  as  it  was  impossible  to  detect 
statistically  significant  differences  between  results  in  most  cases.     In  this 
instance,  a  statistically  superior  approach  would  be  provided  if  the  composite 
dendrogram  were  computed  such  that  the  probability  of  misgrouping  hospitals 
was  minimized.     While  the  likelihood  of  grouping  two  nonhomogeneous  hospitals 
together  would  be  reduced  by  this  conservative  approach,  it  might,  on  the 
other  hand,  result  in  an  increased  number  of  groups.     The  calculation  of  this 
composite  dendrogram  and  an  example  illustrating  its  use  are  presented  in 
the  following  sections. 


B.l.     Cophenetic  Correlation  Coefficient 

One  measure  which  may  be  used  to  evaluate  resultant  dendrograms,  first 
suggested  by  Sokal  and  Rohlf  (1962),^  is  referred  to  as  the  cophenetic 
(i.e.,  class  level)  correlation  coefficient.     Since  the  hierarchical  programs 
described  in  the  previous  section  divide  the  total  measure  of  homogeneity  (Z) 
in  the  resultant  dendrogram  into  25  distinct  intervals  or  class  levels,  the 
level  where  any  pair  of  hospitals  are  joined  can  easily  be  determined  for  any 
algorithm.     Then,  given  the  vector  of  joining  class  levels  for  all  pairs  of 
hospitals,  the  cophenetic  correlation  coefficient  can  be  defined  as  the 
correlation  coefficient  between  the  vector  of  class  levels  and  the  vector  of 
similarity  measures  or  distances  between  all  hospital  pairs. 


Measures  of  this  type  have  been  used  by  Boyce  (1969)  and  Green  and 
Carmone  (1970). 
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For  the  (six    hospital)  example  in  Figure  3.8,  there  are  15  hospital  pairs, 
which  are  enumerated  in  Table  3.2.     For  each  hospital  pair,  one  element  in 
the  vector  of  pairwise  joining  levels  is  easily  determined  from  the  dendro- 
gram in  Figure  3.8;  i.e.,  the  minimum  value  of  £(level)  from  1  to  9  at  which 
the  two  hospitals  are  grouped  together.     For  the  same  pairwise  combination, 
elements  of  another  vector  of  distance  scores  are  generated  from  Figure  3.5. 
The  cophenetic  correlation  coefficient  is  defined  as  the  Pearson  product 
moment  correlation  between  the  two  vectors. 

Several  problems,  however,  arise  from  the  use  of  this  approach.  First, 
pairwise  joining  levels  are  defined  only  for  discrete  ordinal  values  (in  Table 
3.2,  values  from  1  to  9)  while  the  distance  score  vector  is  defined  by  con- 
tinuous interval  quantities — thereby  creating  potential  difficulties  when 
computing  the  correlation  between  the  two  vectors  (see  Feller,  1968).  Thus, 
by  using  ordinal  values  of  the  joining  level,  which  approximates  the  exact 
homogeneity  level  at  which  each  hospital  pair  was  joined,  some  information 
was  sacrificed.     Lastly,  use  of  the  Pearson  product  moment  correlation  coeffi- 
cient assumes  that  the  vector  elements  are  drawn  from  multivariate  normal 
distributions . 


TABLE  3.2 

Cophenetic  Correlation  Coefficient:     Absolute  Validation 


Hospital  Pair 


Pairwise 
Joining  Levels 


Distance  Scores 


Distance 
Joining  Scores 


1,2) 
1,3) 
1,4) 
1,5) 
1,6) 
2,3) 
2,4) 
2,5) 
2,6) 
3,4) 
3,5) 
3,6) 
4,5) 
4,6) 
5,6) 


4 
4 
4 
9 
9 
1 
2 
9 
9 
2 
9 
9 
9 
9 
5 


16.0 
15.0 
15.0 
35.0 
32.0 
12.5 
13.0 
38.0 
40.0 
13.5 
32.0 
32.0 
35.0 
34.0 
18.0 


16.0 
16.0 
16.0 
40.0 
40.0 
12.5 
13.5 
40.0 
40.0 
13.5 
40.0 
40.0 
40.0 
40.0 
18.0 
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n! 


In  general,  the  number  of  pairwise  combinations  is  C 


(n-2)!  2! 
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The  first  two  problems  can  be  resolved  by  using  the  exact  value  of  i  when 
each  hospital  pair  was  joined  (in  lieu  of  the  joining  level)  and  then  comparing 
the  vector  of  pairwise  distance  scores  with  this  vector  of  joining  distance 
scores.     While  there  remains  a  one  to  one  correspondence  between  the  joining 
level  and  joining  distance  score  vectors  in  Table  3.2,  the  vectors  being  corre- 
lated now  both  consist  of  interval  quantities.     The  last  problem  (i.e.,  the 
normality  assumption)  is  easily  resolved  by  using  the  nonparametric  Spearman 
correlation  coefficient.     This  measure  will  hereafter  be  referred  to  as  the 
cophenetic  correlation  coefficient. 

B.2.     Composite  Dendrogram  Calculation 

Using  several  heuristic  algorithms  and  the  square  of  their  respective  cophenetic 
correlation  coefficients  as  measures  of  validity,  the  results  can  be  combined 
into  a  single  composite  dendrogram.     Basically,  the  approach  here  is  to  group 
hospitals  together  only  when  a  "weighted  majority"  of  the  algorithms  agree 
that  such  hospitals  are  in  fact  homogeneous.     Such  an  approach  has  the  effect 
of  minimizing  the  probability  that  dissimilar  hospitals  will  be  grouped  together, 
while  allowing  the  possibility  that  additional  groups  will  be  created.  Six 
heuristic  cluster  algorithms  were  used  to  form  a  complete  composite  dendrogram; 
these  algorithms  (described  in  the  previous  section)  included  the  following: 

1.  complete  linkage, 

2.  average  linkage  between  merged  groups, 

3.  centroid  method 

4.  median  method  of  Gower , 

5 .  average  linkage  within  groups  , 

6.  Ward's  suboptimization  method. 

To  find  the^  composite  dendrogram,  let  Abe  the  set  of  individual  algorithms 
used  and  r^  (for  6eA)  equal  the  squared  cophenetic  correlation  coefficient  for 
each  respective  algorithm  (where  0  <^  r^  <_  1)  .     In  order  to  vary  the  sensitivity 
of  the  composite  dendrogram  to  differences  among  the  separate  cluster  algorithms, 
a  single  coefficient  8  is  defined,  where  the  range  of  8  is  defined  over  the 
discrete  interval  of  class  levels   (in  our  study,  from  1  to  25).  Basically, 
equals  the  number  of  recognized  class  levels  in  each  dendrogram;  for  example, 
if  8  equals  1   (the  least  sensitive  position)  then  all  levels  are  considered 
together  and  all  hospitals  are  grouped  at  level  1.     If  8  equals  25   (the  most 
sensitive  position),  then  each  joining  class  is  recognized;  if  8  equals  12.5 
then  class  levels  1  and  2  are  equated,  class  levels  3  and  4  are  equated,  etc. 
8  is  ^lso  used  to  define  n^-^>   tne  number  of  recognized  class  levels  at  which 
the  i      and  j       hospitals  were  joined  by  algorithm  6.     For  example,   if  8  equals 
25,  then  n .  .  ^  simply  equals  the  class  level  at  which  the  i*"*1  and  j  hospitals 
join  together  by  algorithm  6;  if  8  equals  12.5,  then  n. . g  equals  1  if  hospitals 
i  and  j  are  joined  at  either  class  level  1  or  2,  etc.  1^ 

2 

Given  r  x,   8,  and  n..„,  a  measure  d..  can  be  found  which  expresses  the  "distance" 
or  similarity  between  all  pairs  of  hospitals  based  on  joining  class  levels  for 
all  hospital  pairs  from  individual  dendrograms,  weighted  by  their  respective 
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2 

rr.     Such  a  measure  is  defined  as  follows: 

o 

I  (l--^)r^ 


6eA  3 
r' 


dij  =  1  r—f        •  <3-9> 


6eA 

1  2 

The  range  of  d^.  varies  from  —  to  (for  all  r^  =  1);  the  sensitivity  index  of 
3  is  easily  visualized  as  the  determinant  of  the  range  of  d..   (which  increases 
with  the  value  of  3)  •     Thus,  a  larger  value  of  3  results  in1^a  larger  range  of 
d..,  more  discrimination  between  hospitals,  and  more  sensitivity  in  the  compo- 
site dendrogram.     $  Is  a  useful  concept  in  that  many  of  the  groups  formed  at 
the  initial  class  levels  are  two  small  and  numerous  to  be  of  much  interest. 
Thus,  setting  a  smaller  value  of  $  (say  3-8)  eliminates  examination  of  these 
structures  and  concentrates  upon  dominant  structures  or  partitions.     Note  that 
d..  is  indeterminant  if  all  r2  are  equal  to  zero. 

Examining  the  computation  of  d„  further,  it  becomes  apparent  that 

p      2_  1     r  2  r  2 

d..  =  1  -  6eA  <5eA     j  ..      ..    ,  1    6eA  J 


ij      ~   7j   =1-1  +  R 

6gA     0  6eA  r6 


=  1  r  2 

0    r      2  X  nij6  r6 
3    I    r-  6eA 

6eA  0 

Since  ^    is  constant,  the  term  can  be  ignored  and       .  defined  simply  as 

6    I  r 

6eA  0 

d. .  =    I    n. . -  r^  .  (3.10) 
W      6iA    lj6  6 

In  (3.10),  d..  now  ranges  from  r|  to  r|3  (where  n  is  the  number  of  algorithms) 
2 

for  r~  =  1;  thus,   (3.10)  is  identical  to  changing  the  scale  of  (3.9)  by  a 
constant  factor  of  ri3. 

To  illustrate  the  calculation  of  the  composite  dendrogram,  assume  that  two 
algorithms  and  their  respective  dendrograms  (pictured  in  Figure  3.10)  are  to 
be  combined,  where  n,  =  2,  n  =  5  hospitals,  and  cophenetic  correlation  coeffi- 
cients for  both  dendrograms  (r^)  are  equal  to  1.0.     For  3=4  (most  sensitive 
value)  and  3=2,  the  matrices  of  similarity  measures  are  shown  in  Table  3.3; 
Figure  3.10  shows  the  resultant  composite  dendrograms.    As  evident  from  an 
examination  of  the  resultant  dendrograms,  a  value  of  3  =  4  results  in  some 
discrimination  between  the  grouping  of  hospitals  1,  2,  and  3,  while  a  value  of 
3=2  shows  no  discrimination. 
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FIGURE  3.10 

Composite  Dendrogram;  Example 


LEVEL 
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(K^  =  1.0) 
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TABLE  3.3 


Similarity  Measures  d . 


(Example:     Figure  3.10) 


3  =  4 


3  5  8 
3  5  8 

—  58 
8 


3         3  4 
3         3  4 
—  34 
4 


Note  that  the  procedure  for  combining  dendrograms  essentially  defines  a  complete 
linkage  algorithm  using  d..  as  similarity  measures.     Thus,  once  d..  values  are 
calculated,  the  composite    dendrogram  requires  no  new  computer  algorithm. 

In  practice  it  was  found  that  value  of  3  equal  to  25,  12.5,  or  8  resulted  in 
the  most  useful  structures.     However,  given  the  development  of  measures  to  detect 
and  evaluate  dominant  partitions  within  a  dendrogram  (described  in  the  follow- 
ing section),   it  became  unnecessary  to  examine  various  values  of  3.  Thus, 
due  to  the  desire  to  minimize  the  probability  of  clustering  nonhomogeneous 
hospitals,  the  value  of  3  was  set  to  25  for  purposes  of  this  study,  and  only  the 
most  sensitive  composite  dendrograms  were  used. 

IV.     Partition  Evaluation 

Having  determined  the  composite  dendrogram,  the  problem  still  remains  of  evalu- 
ating the  dendrogram  and  finding  which  partition  offers  the  best  tradeoff 
between  total  homogeneity  and  number  of  groups.     Note  that  in  any  dendrogram 
(Figure  3.8  for  example),  a  partition  can  be  formed  by  placing  a  vertical  line 
through  any  class  level.     Thus,  the  initial  question  is  how  to  objectively 
evaluate  partitions  formed  (by  vertical  lines)  at  each  class  level,  without 
having  to  rely  upon  visual  examination. 

A.     Expected  Distinctiveness  Defined 

Examining  a  dendrogram  carefully,   it  becomes  evident  that  each  horizontal  line 
segment  defines  a  separate  group  of  hospitals  (in  Figure  3.8, 
there  are  11  different  groups  labeled        through  G    ) .     Each  group' ^  distinc- 
tion or  importance,  in  a  sense,  is  defined  by  the  length  of  its  respective 
line  segment  as  measured  by  the  number  of  class  levels.     For  example,  group 
G?  only  exists  for  one  class  level  while  group  G    exists  for  four  class  levels — 
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implying  that        is  more  distinct  than      •     Given  that  a  probability  is  a  number 
from  0  to  1  which  is  the  limit  of  relative  frequencies,  the  probability  of  any 
hospital  H.  belonging  to  group  G    can  be  defined  using  this  concept  of  distinc- 
tiveness. 1  ^ 

Therefore,  letting 


s .  =< 


1  if  hospital  R\  belongs  to  group  G^  (i.e.,  if  H^eGq) 
0  otherwise, 


the  probability  that  any  hospital        belongs  to  group  G^  can  be  defined  as 

, .       (distance  H.  belongs  to  G  ) 

PCs.   )  =  P(H.£G  )  =  "  ~ 

iq  x    q        D-x»  D 

,  .  fd(G  )  s. 
lxm  q  x 

D-x»  D 


where 


and 


d(G^)  =  number  of  class  levels  for  which  group  G^  exists 
D  =  maximum  number  of  class  levels. 


example  in  Figure  3.8,  D=10  and  P^eG  )  =  0.04,  P(H3£Gg)  =0.2,  and 
n)  =  0.     Therefore,  for  a  given  group  G  ,  a  measure  of  expected  dist 


In  the 
P(H2£G1Q 
tiveness  can  be  defined  as 


txnc- 


n 


(G  )  =     7     s.     P(H.eG  )     =    Is.     P(s.  ) 

q  iq      i   q       i^1  iq  iq 


i=l 
n 

=    I  B« 


d(G  )  s. 

q  iq 


i=l 


iq 


For  a  given  partition  P^,  defined  by  a  vertical  line  through  class  level  i, 

the  expected  level  of  dxstinctiveness  E(P^)  for  this  partition  is  then  defined  as 


n 


;(v  =  i   X  si 


'd(G  )  s.  ^ 

q  iq 


q£p£  i=i 


D 


69 


DqeP 


n 


<UG  )     I     (s,  ) 


i=l 


(3.11) 


2  2 

Since  s.     is  a     (0;1)  variable,  s.     =  (s.   )   ,  and    Y     (s.  )     simply  equals  the 
xq  iq  xq  ^  iq 


number  of  hospitals  in  group  G 


Thus,  the  measure  of  expected  distinctiveness 

times  the  sum 


for  any  given  partition  P^,   (3.11),  is  simply  a  constant  term 

of  the  number  of  hospitals  in  each  group  in  partition  P0 times  the  distinctive- 


T 

D 


r 

ness  (in  terms  of  class  levels)  of  that  group.     Since  the  term 


1 


has  no  effect 


(D  is  constant  for  our  computer  programs) ,  the  term  may  be  dropped  and  the  ex- 
pected distinctiveness  E(P)  for  partition  P  hereafter  defined  as 


E(P)  =  I 
qeP 


n 


d(G  )     I     (s,  ) 


i=l 


iq 


(3.12) 


The  expected  distinctiveness  values  for  each  group  are  shown  in  Figure  3.8  (over- 
lay #2);  summing  these  values  for  all  groups  defined  for  each  partition  gives  an 
objective  measure  for  evaluating  the  partition  formed  at  each  class  level.  As 
apparent  from  Figure  3.8,  the  partition  formed  at  the  fourth  class 
level  in  this  example  appears  to  be  optimal  with  E(P^)  =  30. 

B.     Optimal  Partition  Determination 

Given  that  any  partition  P^  can  be  evaluated  by  its  expected  distinctiveness 
E(P),  the  problem  now  is  to  find  the  partition  P  which  maximizes  E(P^) .  The 
problem  is  complicated  by  the  fact  that  the  optimal  partition  isn't  necessarily 
defined  by  a  vertical  line  through  any  class  levels;  thus,  evaluating  E(P^)  for 
all  values  of  Z  =  1,   2,    ...,25  will  not  guarantee  identification  of  the  optimal 
partition. 


In  order  to  find  the  optimal  partition,  as  well  as  suboptimal  partitions  and 
the  tradeoffs  between  partitions  and  expected  distinctiveness,  define  decision 
variables  y    as  follows: 

q 


1     if  group  G     is  included  in  the  optimal  partition  P  , 

q 

0  otherwise 
n 


Then,  letting  c  =  d(G  )  Y  s.  ,  the  problem  of  finding  the  optimal  partition 
P*  can  be  formulated  as  an  integer  linear  programming  problem  below: 
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n 


(SPP)         Maximize  E(P,  )  =  ^    Y    c  y 

q=l 


S.T. 


n 


q=l 


s     y    -  1 
iq  q 


yq  =  0,  1 


for  all  i  =  1,  2,  . . . ,  n 

for  all  q  =  1,  2,   . . . ,  M. 


where 


M  =  total  number  of  identified  groups. 


Using  the  example  shown  in  Figure  3.8  to  illustrate  the  problem  where  the  c 
values  are  indicated  in  the  second  overlay  and  M  =  11  groups,  the  problem 
of  finding  P*  is  stated  as  follows: 

Maximize  E(P)  =  4yn  +  y0  +  y0  +  2y,_  +  5y,.  +  5yc  +  2y,  +  6yg  +  20yQ  +  9y,  n  +  6y 


10 


11 


S.T. 


+  y 


11 


+     y?  + 


y8  + 


+  y7+y8 

+  yQ 


+ 
+ 


+  y 
+  y 


li 


li 


+  y 


li 


+  y 


10 


10 


+  y 


li 


+  y 


li 


Note  that  the  term 


was  dropped  as  this  constant  term  would  not  affect  the 


optimal  solution  determination  in  any  way. 


1 
1 
1 
1 

1 
1 

0,1 


Problem  (SSP)  is  a  well  known  and  widely  studied  combinatorial  problem  referred 
to  as  the  set  partitioning  problem  (see  Garf inkle  and  Nemhauser,  1972).  Theoreti- 
cally, problem  (SSP)  could  be  solved  by  a  general  purpose  integer  linear 
programming  (IP)  algorithm;  however,  even  state-of-the-art  IP  algorithms  would 
severely  limit  the  number  of  variables  and  hence  the  problem  size  which  could  be 
accommodated.     While  a  number  of  specialized  algorithms  have  been  proposed  for 
the  set  partitioning  problem,  it  remains,  in  general,  a  difficult  problem  to  solve. 
However,  the  presence  of  a  single  property  in  this  case,  based  on  the  nature  of 
a  dendrogram,  suggests  a  solution  approach  which  is  easily  implemented  and 
guarantees  optimality. 
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The  solution  procedure  is  based  upon  the  observation  that  a  dendrogram  is 
simply  a  specific  type  of  graph  called  a  tree,  where  the  vertical  line  segments 
and  terminal  points  in  a  dendrogram  can  be  considered  nodes  (G  )  and  horizontal 
line  segments  can  be  considered  directed  arcs  with  weights  cq 
(Figure  3.8).    A  tree  then  is  a  finite  graph  without  a  cycle  and  with  at  least 
two  vertices.     If  every  node  except  one  (node  G^)  of  a  tree  is  the  terminal  node 
of  exactly  one  arc,  then  the  tree  is  said  to  be  an  arborescence    of  root 

The  algorithm  begins  at  the  root  of  the  arborescence  G  and  searches  through 
lower  cluster  levels;  the  solution  procedure  is  based  upon  the  observation  that 
the  branches  of  the  tree  automatically  partition  set  H.     Letting  Ij  be  the  set  of 
arc  indices  branching  from  node  G. ,  problem  (SPP)  becomes  one  of  finding  sets 

J  J 

I.  such  that     )       )      c    is  maximized  (where  c    —  the  expected  distinctiveness 

3  .  -i      t      q  Q 

J  i=l  qel.      ?  H 

3 

of  group  G^  —  represents  the  weight  on  the  arc  terminating  at  node  G^) .  The 

nature  of  the  dendrogram  guarantees  that  the  constraints  are  met  (i.e.,  that 
J 

I      £      G    partitions  H) ;  the  objective  function  allows  the  problem  to  be 
3=1  qel.  q 

decomposed  into  J  subproblems. 


The  algorithm  begins  at  node  G^  and  searches  the  branching  from  nodes  G^;  each 

of  these  nodes  is  subsequently  examined  until  the  algorithm  reaches  the  terminal 
nodes  of  the  tree  (i.e.,  hospitals  grouped  individually)  or  until  certain  condi- 
tions are  met.     Assuming  that  the  search  procedure  is  at  node  Ga  ,  where  Gj  is 
not  a  terminal  node  of  the  tree,  the  following  three  rules  establish  conditions 
for  continuing  or  stopping  the  search  procedure  along  any  branch,  and  guarantee 
that  a  global  optimum  will  be  found. 


Rule  1 


Find  the  set  of  arcs  Ij  branching  from  node  Gj  and  calculate 
I    cq.     If    £      c    >  c.   (where  c.  represents  the  weight  on  the 


qel 


qel^ 


3  ^""3 

arc  terminating  at  node  Gj),  then  the  search  from  all  nodes  Gr(relj) 
must  be  continued  and  group  Gj  eliminated  from  consideration. 

Justification:     If     Y      c     >  c,  the  weights  of  the  branches 

T     q  3 

qelj  J 

emanating  from  Gj  are  greater  than  the  weight  on  branch  c . . 

Since  problem  (SPP)  can  be  decomposed  into  independent  subproblems, 

only  the  set  of  nodes  Ij  need  be  considered. 


Rule  2 :     Calculate  max.  =    £  s. 


i=l 


13 


at  node  G.,  where  &(G.)  is  the 
3  3 


class  level  corresponding  to  node  Gj .     If  maxj  >  Cj ,  then  search 


is  halted  from  node  Gj  and  Gj£P* 


Justification:     Since     Y    s . .  =  number  of  hospitals  in  set  I., 

i-1    ^  J 
the  maximum  possible  expected  distinctiveness  value  from  node  G. 
would  be  defined  by  (3.11)  if  d(G.)  =  £(G.).     Thus,  if  c.   >  J 
maxj  »  no  improvement  is  possible,^  any  search  from  node      G^.  is 
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unnecessary,  and  G.eP*. 

3 


Rule_3 :     At  node  G.. ,  find  min^   (the  minimum  distinctiveness  value  possible 
from  node  G j )  .     If  mm^  >  cj  ,  then  the  search  must  continue  and 
Gj£P*. 

Justification:     Since  minj  represents  the  worst  possible  case 
from  node  Gj ,  if  minj  >  Cj ,  then  continuing  the  search  must 
improve  the  total  distinctiveness  value. 

The  calculation  of  min .  is  based  upon  the  class  level  corresponding  to  node  G. 

3  .3 
(il(G.)),  the  maximum  number  of  subsequent  partitions   (K.)  possible  from  node  G.  , 

n 

and  the  number  of  hospitals   (n . )  in  group  G.    (where  n.  =     £     s..).     The  value  of 


J 


i=l 


is  based  upon  finding  the  maximum  number  of  possible  groups  from  node  G^ 

which  occurs  if  all  subsequent  groups  are  formed  by  pairwise  combinations. 

n  tli 
Given  that  there  are,  on  the  average,  —    groups  at  the  k      class  joining,  the 

2k 

maximum  number  of  groups  is  defined  as  follows, 


Ki  n. 
k=0  sR 


2Kj 


-  1 


2  _  1  J 

2 


+ 


K. 


-  1 


2n.   -  1, 

J 


where  K.  is  the  largest  value  such  that  — *j—  >    1.     (This  problem  is  equivalent 

2  J 

to  the  problem  of  finding  the  number  of  nodes  in  a  binary  tree.)     Given  the 
(2nj-l)   groups,  the  problem  of  finding  minj  is  stated  as  follows: 


Minimize 


Maximum 
i=l,...,K. 


f  I 
neP 


E(Gh) 


1J 


(3.13) 


h=l 


E(Gh)  =  n.  £(G  ) 


E(GL)  >  0     for  all  L 


where  a  dendrogram  of  all  pairwise  combinations  from  node  Gj  defines  partitions 
P^. .     Since  the  number  of  hospitals  in  each  group  Gh  can  be  easily  determined,  the 
problem  of  defining  £(0^)  requires  determining  d(G^).     Problem  (3.13)  can  be 
simplified  somewhat,  however,  by  recognizing  that  all  group  joinings  in  any 
partition  must  take  place  at  the  same  level  in  order  to  maximize     }'  E^Gh^ 

heP^ 

for  any  partition  P^ .     Thus,  problem  (3.13)  can  be  reduced  to  finding  distances 
for  any  partition  P^_.  . 
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Unfortunately,  not  all  groups  will  join  in  a  given  partition  if  there  exists 
an  odd  number  of  groups;  thus,  the  distance  (and  expected  distinctiveness) 
for  this  group  will  extend  into  the  next  partition.     For  example,  consider 
the  dendrogram  of  pairwise  joinings  in  Figure  3.11  where  K_  =  3 ,  ^(^)  =  7, 

and  n.  =  5.     In  this  case,  group  G<-  in  partition  P^    cannot  join  another  group 

until  level  5,  thereby  increasing  ECP^^.)  from  15  to  17  (the  maximum  distinctiveness 

in  this  case) . 

Therefore,  given  the  distances  d(P..)  for  partitions  P..   (where  partitions 
are  defined  by  group  joinings),  the^maximum  possible  expected  distinctiveness 
value  is  defined  as 


":,::...k.  p  ^v+^  n-)) 


where 


and 


c.   fd(.»  -  (21-1)  d(P.+1>.)  6..  +  (2~)  d(P         )  y 


i-2, 


y . .  =• 


1    if  <- 


i-1 


>  is  odd , 


0  otherwise , 


and  <•>  represents  the  smallest  integer  greater  than  or  equal  to  (•).  (Calcula- 
tions are  represented  in  Table  3.4;  for  example,  if  <y>  and  <y>  are  both  odd, 
then  E(P3j)  =  njd(P3j)  +  4  d(P4  )  +  2  d(P2j).) 

TABLE  3.4 

Minimum  Expected  Distinctiveness  Calculation 


PARTITION 


K 


Distance : 

No.  of  groups: 

No.  of  hospitals 
in  merged  groups 


n 


d(P2j) 

<|> 


d(P3.) 

<!> 

3 


d(P.  .) 
1J 

<    n  > 
.i-1 


.i-1 


d(PK.> 
2Kj-l 

2^ 


The  problem  of  finding  minj  then  can  be  stated  as 


Min  max 

.  1  v  (n. d(P. .)  +  c. 
i=l,... Kj      J  lj 


d(-) 


(3.14) 
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FIGURE  3.11 

Binary  Dendrogram  Illustration 
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S.T. 

K. 

P    d(P..)  =  £(G. 
i=l         1J  J 


d(P. .)  >  0 

While  problem  (3.14)  is  difficult  to  solve  in  general,  it  was  noted  that  in 
most  cases  the  optimal  solution  to  problem  (3.14)  could  be  found  by  setting 

A(G.)  A(G.)  -  d(P    )  £(G.)  -  d(P_.)  -  d(P„.) 

d<V  -  '-ih*  d(V  =  <  Y-i    J  >?  d(V  -     —  ^ 

J  J 
etc.     In  addition,  the  computer  programs  used  to  construct  the  dendrograms 
(Anderberg,  1973)  arbitrarily  use  25  class  levels — the  upper  bound  for  &(G.). 
Thus,  tables  displaying  values  of  min.  could  be  easily  constructed  for  values  of 
£(G_.)  and  various  group  sizes  (nj)* 

The  entire  algorithm  for  finding  P*  was  never  computerized  as  it  was  found  that 
the  search  procedure  effectively  allowed  manual  solutions  to  be  found  in  a  few 
minutes,  even  for  the  largest  data  set  (1,070  hospitals)  and  most  complex 
(i.e.,  sensitive)  composite  dendrograms. 

Not  only  can  partitions  be  easily  found  and  evaluated  by  this  approach,  but  a 
tradeoff  between  the  number  of  groups  and  total  expected  distinctiveness  can  be 
evaluated.     To  examine  suboptimal  partitions,  it  is  simply  necessary  to  find 
which  groups  should  be  split  or  combined  to  make  the  smallest  marginal  decrease 
in  E(P*).     This  is  easily  accomplished  by  arbitrarily  setting  c    =  00  for  each 
qeP*  and  resolving  problem  (SPP).     A  graph  showing  the  tradeo  ffi  between  the 
number  of  groups  and  expected  distinctiveness  for  the  example  in  Figure  3.8  is 
shown  in  Figure  3.12. 


V.     Criteria  for  Cluster  Validation 


Testing  the  validity  of  the  hospital  partitions  is  essential  if  credibility  is 
to  be  placed  on  any  reimbursement  system  derived  from  the  resultant  groups. 
While  there  is  not  single  test  or  procedure  for  measuring  "goodness  of  fit"  in 
multivariate  grouping  problems,  a  number  of  approaches  were  adopted  in  this 
study  which  allowed  for  a  through  examination  of  clustering  results.  The 
approaches  used  here,  which  are  briefly  described  below,  fall  into  three  major 
categories:     (1)  descriptive  statistics  for  examining  the  reasonableness, 
intuitive  consistency,  and  parsimony  of  the  identified  groups,   (2)  nonparametric 
tests  for  statistical  consistency,  and  (3)  parametric  tests  based  on  multivariate 
normal  assumptions. 


A.     Descriptive  Statistics 


Examination  of  the  identified  groups  initially  consisted  of  visual  group  Inspec- 
tion, including  nonquantif iable  group  characteristics  such  as  state  and  county 
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names,  types  and  hospitals,  etc.,  and  the  calculation  of  the  mean,  standard 
deviation,  and  measures  of  skewness  and  kurtosis  for  each  independent  char- 
acteristic of  the  group.     On  the  basis  of  the  similarity  measure  used  (the 
squared  Euclidean  Distance),  several  statistics  were  found,  including  the  within 
group  diameters  (i.e.,  the  maximum  within  group  distance),  the  within  group  sum 
of  squared  distances,  and  the  distance  to  each  group  centroid. 

B.     Nonparametric  Tests  for  Statistical  Consistency 

Most  proposed  tests  of  cluster  validation  may  themselves  be  classified  into  two 
categories:     (1)  tests  based  upon  results  from  various  clustering  schemes,  and 
(2)  tests  based  upon  statistics  calculated  from  multivariate  normal  assumptions. 
It  should  be  nated  that  tests  comparing  results  from  various  clustering  schemes 
are  relative  tests;  i.e.,  they  can  only  indicate  differences  among  alternative 
clustering  techniques  but  not  the  direction  of  those  differences.  Tests 
assuming  that  the  data  have  been  drawn  from  distinct  populations  of  known  struc- 
ture, on  the  other  hand,  can  measure  the  absolute  effectiveness  of  resultant 
paratitions.     Both  types  of  tests  are  profitable  and  were  used  in  this  study. 

A  problem  frequently  encountered  throughout  this  study  concerned  measuring  the 
degree  of  similarity  (or  dissimilarity)  between  two  or  more  partitions.  For 
example,  it  was  necessary  to  compare  groups  of  hospitals  determined  by  our 
cluster  analytic  approach  with  groups  used  by  HCFA  which  had  been  determined  by 
multiple  cross  classification  (described  in  the  following  chapter). 

Recently,  Rand  (1971)  suggested  a  statistic  for  measuring  the  similarity  between 
partitions  of  hospitals  based  on  three  assumptions:     (1)  each  hospital  is  uniquely 
assigned  to  a  subset  or  group  (i.e.,  no  overlapping  groups  are  considered), 
(2)  it  is  equally  meaningful  for  a  partition  to  not  group  two  hospitals  together 
as  it  is  for  a  partition  to  group  the  hospitals,  and  (3)  all  hospitals  are 
equally  weighted  in  group  determination.     Furthermore,  Rand  demonstrated  several 
properties  and  simplified  computational  forms  for  the  statistic. 

Unfortunately,  it  is  possible  to  show  that  Rand's  suggested  statistic  which  is 
equivalent  to  a  simple  matching  coefficient  used  to  describe  similarity  between 
vectors  of  binary  values,  may  be  seriously  distorted  by  the  second  assumption 
above  when  relatively  large  numbers  of  nontrivial  groups  exist  in  a  partition. 
Therefore,  a  new  statistic  was  calculated  which  eliminated  any  distortion. 

Given  two  hospitals,  each  described  by  a  vector  of  m  dichotomous  characteristics 
[x    ,  x     .,  x  . ]  ,  where 


a  number  of  measures  (usually  known  as  coefficients  of  association)  can  be 
constructed  which  measure  the  overall  similarity  between  the  irn  and  the  j 
hospital.     One  measure,  s(j,j),  introduced  by  Sokal  and  Michener  (1958)  and 
studied  by  Goodall  (1967),  is  known  as  the  simple  matching  coefficient  and  is 
simply  defined  as  the  ratio  of  the  number  of  matches  to  the  total  number  of 


hospital  contains  the  c 


th 


characteristic 
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characteristics;  or,  more  precisely,  if 


then 


,  1  if  x  .  =  x  . 

Ac  =  J  cx  ci 

10  otherwise 


m 


A 


c=l 


(3.15) 
(3.16) 


This  simple  matching  coefficient  (3.16)  is  often  challenged  on  the  grounds  that 
mutually  characteristic  absences  (i.e.,  zeroes  in  the  characteristic  vectors  for 
both  hospitals)  contribute  to  the  measure  of  similarity.     Therefore,  an  opposite 
approach  is  to  simply  eliminate  the  counting  of  these  mutually  lacking  character- 
istics.    This  measure,  s(i,j),  known  as  the  Jac_card  coefficient,  is  more 
precisely  expressed  by  redefining  A  ,  call  it  X  ,  as, 


A  = 
c 


1     if  x  .  =  1  and  x  . 

ci  cj 

0  otherwise 


=  1 


and  defining  s(i,j)  as  follows: 


m 


"(i,j)  =  I 
c=l 


m  - 


m 


m 


£    A    -    £  A 

L      c  ,   ■  c 


(3.17) 


ic-l    ~  c-1 

(A  myriad  of  additional  measures  exist  on  the  continuum  between  these  two  mea- 
sures, which  mostly  use  various  schemes  for  weighting  matches  and  mismatches. 
For  a  description  of  fourteen  such  measures,  see  Anderberg,  1973.) 


Given  n  hospitals  segregated  by  two  partitions  P.  and  P.,  measures  describing 
a  degree  of  similarity  between  the  two  partitions  can  be  constructed  by 
treating  pairs  of  hospitals  as  characteristics  and  proceeding  in  a  manner 
analogous  to  the  measures  of  the  preceeding  section.     Letting  C  =  {l,  2, 
be  the  set  of  all  hospital  pairs,  a  measure  equivalent  of  the  simple  matching 
coefficient  can  be  derived  by  letting 

f\     .,        ....      -  th 


x 


ck 

defining  A 


-  \ 


1  if  partition  P  groups  the  c 
0  otherwise 


hospital  pair  together 


by  (3.15),  and  summing  (3.16)  over  all  hospital  pairs;  i.e. 


s(i,j)  =    }■  \ 
ceC  ^n^ 


(3.18) 


The  simple  coefficient  (3.18)  was  proposed  by  Rand  (1971);  the  Jaccard  coeffi- 
cient equivalent  for  partitions  can  be  defined  in  similar  fashion  using  X  and 
(3.17).  C 
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The  simple  matching  coefficient  has  the  disadvantage  that  when  comparing  two 
partitions  with  relatively  large  numbers  of  (nontrivial)  groups,  its  value  will 
tend  to  increase  directly  as  a  function  of  the  number  of  groups,  due  to  the 
fact  that  hospital  pairs  tend  to  spread  across  groups  and  thus  increase  the 
number  of  "nonmatches. "    Alternatively,  consider  the  hospital  being  randomly 
partitioned.     As  the  number  of  groups  increase,  the  simple  matching  coefficient 
fails  to  account  for  the  increasing  probability  of  two  hospitals  not  being  grouped 
together.     Conversely,  the  Jaccard  equivalent  would  present  exactly  opposing 
difficulties,  as  it  would  fail  to  consider  the  higher  likelihood  that  two  hospi- 
tals will  be  grouped  together  when  fewer  groups  exist.     Thus,  it  would  appear 
that  a  measure,  adjusted  for  the  number  of  groups  in  each  partition,  should  have 
values  between  those  of  the  simple  matching  coefficient  and  the  Jaccard  coefficient, 

To  develop  a  measure  less  dependent  on  the  number  of  groups  present,  assume  that 
partition  P-^  has  ni  groups  and  hospitals  are  grouped  randomly.        Then,  the 
probability  that  the  c      pair  of  hospitals  will  be  grouped  together  in  partition 

P.  is  simply  P  (x  .  =  1)  =  —    — .     The  probabilities  for  the  four  possible  cases 
i  r    ci  ni  nj 

th 

for  the  c      hospital  pair  are  summarized  in  Table  3.5. 


TABLE  3.5 


Probabilities  for  Random  Grouping  of  the  c*"*1  Hospital  Pair 


Partition  P 


Partition 

P. 
J 


P  (x  .  =  0) 
r  cj 


P  (x  .  =  1) 
r  cj 


P  (x  .  =  0) 
r  ci 


P  (x  .  =  1) 
r  ci 


(1  -  1/n.)   (1  -  1/n.) 

i  J 

(1/n.)   (1  -  1/n.) 
i  J 

(1  -  1/n.)  (1/n.) 

3-  J 

(1/n.)  (1/n.) 
1  J 

Then,  using  the  probabilities  that  these  events  do  not  occur  randomly,  an 
undistorted  statistic  A  is  easily  defined  as  follows: 

I  X  |l-  P  (x  .)  P  (x  .)| 
£       c  1        r    ci      r    cj  1 

A    =    (3.19) 

ln  |l  -  P  (x  .)  P  (x  .)  I 
ceC  1  r    ci      r    cj  1 

where  the  probabilities  shown  in  Table  3.4  are  used  as  weights  and  the  matching 
parameter        is  defined  by  (3.15). 


/  o 

With  this  assumption,  the  development  of  this  statistic  is  conceptually 
similar  to  a  measure  of  association  by  Hyvarinen  (1962) . 
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To  illustrate  the  calculation  of    A,  assume  that  five  hospitals  are  partitioned 

into  two  partitions,  P^^  =  {A}  {BCD}  {E},  and  ?2  =  {AB}  {CDE>.     Examining  all 

possible  hospital  pairs,  Table  3.6  indicates  the  ten  values  of  x  ^,  X  ,  and 

P  (x  J)  for  each  pair.  x 
r  ci 


TABLE  3.6 

Numerical  Illustration  -  Five  Hospitals 


Hospital  Pair 

Weight 


Calculation 

AB 

AC 

AD 

AE 

BC  BD 

BE 

CD 

CE 

DE 

Partition  P1 

x  i 
cl 

0 

0 

0 

0 

1  1 

0 

1 

0 

0 

P  (x  ,) 
r  cl 

.67 

.67 

.67 

.67 

.33  .33 

.67 

.33 

.67 

.67 

Partition 

Xc2 

1 

0 

0 

0 

0  0 

0 

1 

1 

1 

Pr<*c2> 

.5 

.5 

.5 

.5 

.5  .5 

.5 

.5 

.5 

.5 

Weights 

X 

c 

0 

1 

1 

1 

0  0 

1 

1 

0 

0 

l-Pr(xcl)Pr(xc2) 

.667 

.667 

.667 

.667 

.834  .834 

.667 

.834 

.667 

.66: 

In  this  example,  the  simple  matching  coefficient  (3.18)  gives  a  value  of  0.5 
and  the  Jaccard  coefficient  gives  a  value  of  0.167;  the  A  coefficient  equals 
0.488. 

The  measure  A  defined  above  varies  from  0  (no  pairwise  matching)  to  1  (complete 
pairwise  agreement  between  partitions).     If  all  probabilities  P(x    )  are  equal 
(the  case  when  only  two  groups  are  present),  A  reduces  to  the  simple  matching 
coefficient. 

In  order  to  test  the  A  statistic  twenty  hospitals  were  randomly  partitioned 
into  varying  numbers  of  groups  by  randomly  assigning  each  hospital  a  value 
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from  2  to  K  (where  K  is  the  number  of  groups).     As  the  value  of  K  increased, 
the  proportion  of  non-empty  groups  decreased  substantially,  as  evident  in  Table 
3.7.     After  randomly  determining  two  partitions,  the  three  statistics  indicated 
were  calculated  for  forty  trials  and  averaged.     The  results  are  displayed  in 
Table  3.7 — as  expected,  only  the  proposed  statistic  consistently  indicates 
that  approximately  one  half  of  the  hospitals  are  similarly  grouped. 

If  groups  are  randomly  determined,  the  expected  value  of  A  if  0.5.  By 
realizing  that  the  expected  number  of  OTU  pairs  in  each  event  listed  in  Table 
3.5  is  simply  p(xci)  P(x     )  -*-~*>(xcj)   '  it:  is  easily  shown  that  the 

expected  value  of   A  reduces  to 

(n) 
2  K2} 

E  (A)  =  =  0.5. 

4  O 

The  undistorted  statistic  developed  here  was  widely  used  in  this  study.  For 
example,  the  statistic  was  used  to  compare  optimal  and  suboptimal  partitions, 
partitions  determined  in  this  study  and  the  partitions  created  in  SSA  cross 
classification  scheme,  partitions  created  by  discriminant  functions,  etc. 


C.     Parametric  Tests  for  Statistical  Validity 


The  preceeding  sections  have  alluded  to  the  use  of  parametric  tests  for 
examining  the  validity  of  resultant  clusters.     If  it  is  assumed  that  a  given 
number  of  populations  exists  and  it  is  the  task  of  the  clustering  procedures 
to  correctly  identify  those  populations  strong  tests  of  significance  can  be 
developed  for  measuring  absolute  effectiveness  of  a  clustering  result. 

Parametric  tests  are  based  on  the  assumption  (supported  for  large  samples 
by  the  Central  Limit  Theorem)  that  elements  in  each  population  are  distributed 
by  a  multivariate  normal  density  function.     Accepting  this  assumption,  three 
approaches  were  used  to  test  the  resultant  clusters:     (1)  one-way  analysis  of 
variance  (ANOVA) ,    (2)  linear  discriminant  analysis,  and  (3)  regression  analysis 


C.l.     Analysis  of  Variance 

In  order  to  test  resultant  clusters,  a  one-way  analysis  of  variance  was  used 
to  test  for  significant  differences  between  all  identified  groups  for  all  varia 
bles  (i.e.,  factor  scores).     That  is,   if  seven  factors  were  used  to  describe 
each  hospital,  seven  one-way  analyses  of  variance  were  run  using  each  factor  as 
the  independent  variable  and  a  variable  indicating  group  as  the  dependent 
variable.     Based  on  the  sum  of  squared  deviations  between  groups  and  the  sum  of 
squared  deviations  not  explained  by  the  independent  variable,  an  F  ratio  was 
calculated  and  used  to  test  if  significant  differences  exist  between  groups  for 
each  factor  used.     For  these  tests  (as  well  as  other  parametric  tests),  a 
significance  level  of  0.05  was  used. 
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TABLE  3.7 

Comparisons  of  20  Ramdomly  Partitioned  Hospitals* 


Number         Average  Number  of      Simple  Matching  Jaccard 
of  Groups        Nonempty  Groups  Coefficient  Coefficient  A 


2 

2.00 

.495 

.337 

.495 

3 

3.00 

.552 

.194 

.496 

4 

4.00 

.619 

.139 

.494 

5 

4.95 

.679 

.121 

.502 

6 

5.84 

.716 

.088 

.493 

7 

6.67 

.759 

.073 

.505 

8 

7.45 

.782 

.069 

.504 

9 

8.17 

.804 

.051 

.502 

10 

8.95 

.826 

.051 

.513 

11 

9.42 

.834 

.051 

.504 

12 

9.95 

.849 

.053 

.510 

13 

10.46 

.863 

.043 

.513 

14 

10.84 

.868 

.029 

.503 

15 

11.21 

.879 

.040 

.513 

16 

11.67 

.886 

.029 

.509 

17 

12.12 

.893 

.027 

.514 

18 

12.27 

.897 

.027 

.510 

19 

12.69 

.902 

.030 

.509 

20 

12.55 

.901 

.025 

.494 

*A11  results  are  based  on  simulation  runs  of  40  trials  for  each  target  group 
size. 
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C.2.     Discriminant  Analysis 

Discriminant  analysis,  previously  mentioned  in  the  first  chapter,  assumes  that 
groups  have  been  identified  and  attempts  to  construct  linear  functions  which 
explain  the  differences  between  groups.     Thus,  having  found  clusters,  we  assumed 
that  each  cluster  identified  distinct  populations  normally  distributed  about 
their  respective  cluster  centroids,  and  employed  linear  discriminant  analysis  to 
find  these  functions  and  reclassify  the  hospitals.     In  this  case,  information 
on  the  accuracy  of  the  discriminant  functions  is  provided  by  the  number  of  hos- 
pitals which  are  correctly  reclassified  by  the  functions,  and  Wilks  lambda, 
which  indicates  whether  or  not  the  functions  are  statistically  significant. 

If  significant,  discriminant  analysis  offers  one  means  for  reclassifying 
uniquely  grouped  hospitals  (i.e.,  isolates).     Using  the  discriminant,  functions 
to  reclassify  isolates  in  this  case  is  equivalent  to  computing  the  squared 
Euclidean  distance  to  the  centroid  of  the  groups  determined  by  the  discriminant 
analysis,  computing  each  isolate' s  respective  chi  squared  value  from  this 
distance,  finding  the  probability  of  this  chi  squared  value  (from  a  chi  square 
table) — where  this  quantity  determines  the  proportion  of  hospitals  which  lie 
as  close  to  or  closer  to  the  cluster  centroid,  and  assigning  the  isolate  to 
the  group  which  maximizes  this  probability  of  membership. 

C.3.     Regression  Analysis 

If  we  assume  that  variations  in  cost  per  case  are  a  function  of  variation  in 
both  the  economically  "legitimate"  variables  (identified  in  Chapter  Two)  which 
were  used  to  determine  the  clusters  and  "non-legitimate"  or  inefficiency  varia- 
bles (e.g.,  number  of  beds,  number  of  RN's  and  LPN's,  etc.),  and  if  clusters 
reduce  the  amount  of  variation  in  legitimate  variables  (as  tested  by  the  ANOVA) , 
then  it  has  been  argued  that  it  should  be  possible  to  detect  an  appropriate 
reduction  in  cost  per  case. 43    one  means  for  testing  this  hypothesis  is  provided 
by  regression  analysis;  i.e.,  using  cost  per  case  as  the  dependent  variable  and 
both  legitimate  and  inefficiency  variables  as  the  independent  variables,  find 
the  appropriate  regression  equations  for  each  identified  cluster.     If  the  hypo- 
thesis stated  above  is  true,  then  the  regression  coefficients  for  the  inefficiency 
variables  should  explain  the  majority  of  variation  in  cost  per  case  and  the 
legitimate  variables  should  have  little  or  no  significant  effect  (for  each  cluster). 

One  would  expect  to  find  this  difference  between  regression  coefficients,  however, 
only  if  the  following  two  assumptions  are  valid:     (1)  the  functions  relating 
cost  per  case  to  the  independent  variables  must  be  linear,  and  (2)  the  points 
representing  hospitals  must  in  fact  approach  a  multivariate  normal  swarm 
around  each  cluster  centroid.     The  uncertainty  associated  with  the  first  assump- 
tion was  a  major  factor  contributing  to  the  selection  of  cluster  analysis,  as 
discussed  in  the  first  chapter.     The  second  assumption  is  basic  for  any  para- 
metric test,  but  again  there  is  no  a  priori  reason  to  expect  this  to  hold, 
especially  in  clusters  with  relatively  fewer  hospitals.     The  importance  of  this 
assumption  can  be  illustrated  by  the  example  represented  in  Figure  3.4;  assume 
that  in  a  final  partition,  hospitals  H~  and  H    in  this  example,  are  grouped 


43 

For  further  discussion  of  this  point,  see  Phillip  and  Iyer  (1972). 
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together  (a  reasonable  assumption  as  the  distance  between  the  points  represent- 
ing these  two  hospitals  is  relatively  small).     However,  if  one  determines  a 
regression  equation  using  income  (I)  as  well  as  inefficiency  variables  as 
independent  variables,  it  is  entirely  possible  that  the  income  variable  could 
explain  100  percent  of  the  variation  in  cost  per  case  between  the  two  hospitals 
While  this  example  of  two  hospitals  is,  admittedly,  extreme,  it  underscores  the 
importance  of  assuming  clusters  represent  multivariate  normal  swarms  around 
each  cluster  centroid. 

In  addition,  it  is  assumed  that  variation  in  the  cost  per  case  (dependent 
variable)  not  be  dominated  by  the  inefficiency  variable.     If  this  were  to  occur 
one  would  still  detect  significant  differences  between  regression  coefficients 
and  coefficients  of  determination  within  clusters,  but  would  not  be  able  to 
detect  any  significant  differences  between  clusters. 

Therefore,  while  the  results  of  such  regression  analyses  are  reported  in  the 
following  chapter,   it  is  exceedingly  important  that  these  results  be  most  cau- 
tiously interpreted  and  carefully  used  if  no  significant  differences  are  found. 
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CHAPTER  FOUR 

EMPIRICAL  RESULTS 
I.  Introduction 

In  order  to  illustrate  the  economic  framework  and  test  the  feasibility  of  the 
cluster  analysis  methodology,  the  concepts  developed  in  Chapters  Two  and  Three 
were  applied  to  a  1973  data  base  compiled  by  the  Office  of  Policy,  Planning, 
and  Research,  Health  Care  Financing  Administration  (HCFA) .     The  data  set  con- 
tains information  gathered  from  Medicare  cost  reports  for  1973,  Census  Bureau 
reports  (1970  census),  the  American  Hospital  Association  (AHA)  annual  survey 
of  hospitals  for  1973,  and  a  variety  of  demographic  and  socio-economic  data 
files  for  1097  short-term,  general  U.  S.  hospitals. 

From  the  outset,  it  was  clear  that  the  nature  of  the  available  data  set  pre- 
cluded anything  other  than  a  preliminary  test  of  the  feasibility  of  our 
classification  model.     Direct  measures  for  a  number  of  the  variables  identi- 
fied in  Chapter  Two  were  not  available  (e.g.,  case  mix  and  case  mix  severity), 
and  there  were  no  acceptable  means  of  evaluating  the  bias  introduced  by 
substituting  the  surrogate  measures  that  were  available.     Thus,  the  empirical 
results  reported  here  must  be  viewed  as  suggestive  rather  than  conclusive. 
They  are  of  value  only  in  that  they  provide  useful  and  important  information 
about  the  feasibility  of  the  approach,  potential  problem  areas,  and  directions 
for  further  investigation. 

Given  the  problems  of  data  availability,   the  first  step  in  the  empirical  analysis 
was  to  select  measures  that  would  be  used  to  reflect  the  impact  of  the  variables 
identified  in  Chapter  Two:     input  prices,  rural  markets,  case  mix,  and  case  mix 
severity.     The  measures  used  for  the  input  prices  and  rural  markets  variables 
are  described  in  Chapter  Two.     Ideally,  diagnostic  data  would  be  used  for  the 
latter  two  variables. 

However,  as  indicated  above,  no  direct  diagnostic  data  were  available.     In  their 
place,  two  alternative  approaches  were  used.     In  the  first,  the  effects  of 
variations  in  case  mix  were  represented  by  a  number  of  measures  based  on  endo- 
genous characteristics  of  the  individual  hospitals.     This  approach  was  labeled 
the  endogenous  approach.     In  the  second,  a  number  of  measures  reflecting  the 
socioeconomic  character  of  the  county  in  which  the  hospital  is  located  were  used. 
Since  these  measures  are  exogenous  to  the  individual  hospital,   this  method  was 
called  the  exogenous  approach.     The  impact  of  differences  in  case  mix  severity 
was  captured  by  adding  a  number  of  demographic  variables  in  the  endogenous 
approach.     No  further  variables  were  needed  for  case  mix  severity  in  the  exogenous 
approach.     A  complete  listing  of  the  two  sets  of  measures  is  given  in  Table  2.3. 

For  comparison,  a  third  approach  was  also  used.  The  measures  in  this  approach 
were  based  on  the  cross  classification  system  previously  used  by  HCFA  to  group 
hospitals  that  provide  services  to  Medicare  beneficiaries.  The  HCFA  variables 
consisted  of  the  following:     (1)  median  family  income  in  the  state,   (2)  bed 
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size  of  the  hospital  (at  the  end  of  1973),  and  (3)  a  dummy  variable  indicating 
whether  or  not  the  hospital  is  located  in  an  SMSA. 

In  order  to  remove  implicit  weights  caused  by  multicollinearities  among  the 
measures,  all  sets  of  measures  were  first  factor  analyzed  and  subsequently 
represented  only  by  factor  scores  (as  discussed  in  the  third  chapter).  Two 
types  of  explicit  weighting  schemes  were  then  applied  to  the  three  sets  of 
factor  scores.     The  first  type  was  based  on  the  standardized  linear  regression 
coefficients  associated  with  the  factors,  estimated  using  cost  per  case  as  the 
dependent  variable.     The  second  set  of  weights  was  determined  in  such  a  way  that 
each  variable  was  given  equal  (unit)  weight.     These  steps  are  discussed  in  more 
detail  below. 

In  order  to  refine  the  methodology  and  test  the  feasibility  of  using  a  small 
sample  to  estimate  groups  for  the  entire  population,  a  random  subsample  of  194 
hospitals  was  selected.     The  smaller  sample  size  made  it  possible  to  easily 
analyze  all  sets  of  measures  and  weights,  consider  a  number  of  analytic  possi- 
bilities, and  fully  specify  composite  dendrograms.     Following  analysis  of  the 
subsample,  the  complete  sample  was  analyzed  and  the  results  were  compared. 

A.     Data  Base  Description 

The  data  base  was  initially  examined  for  any  missing  and/or  questionable  values; 
missing  values  were  identified  and  replaced  from  secondary  data  sources  whenever 
possible.     Questionable  values  were  defined  as  those  values  lying  outside  a  three 
standard  deviation  interval  about  the  variable  mean.     Any  hospital  having  ques- 
tionable values  which  could  not  be  verified  was  deleted  from  the  sample. 

Of  the  original  1097  hospital  sample  provided  by  HCFA,  a  verified  sample  of 
1070  hospitals  remained  after  screening.     Of  the  hospitals  remaining,  86  percent 
were  accredited  by  the  JCAH,  29  percent  of  the  hospitals  had  some  kind  of  medical 
school  affiliation,  and  60  percent  were  located  within  an  SMSA.     The  average 
bed  size  was  282.4  beds,  with  a  standard  deviation  of  241  beds  and  a  skewness 
of  +1.26  (indicating  that  the  mode  or  most  frequently  occur ing  value  is  less 
than  the  average  value) .     The  measures  used  to  represent  the  relevant  classifi- 
cation variables  and  their  mean  values  among  sample  hospitals  one  given  below. 

As  indicated  in  the  second  chapter,  input  factor  prices  are  represented  by  three 
county  based  wage  measures  and  one  hospital  based  wage  measure.     These  measures 


The  HCFA  system  places  a  hospital  into  a  given  group  based  on  its  number 
of  beds  (0-54  beds,  55-94  beds,  100-169  beds,  170-264  beds,  265-404  beds, 
405-684  beds,  and  over  685  beds),  whether  or  not  the  hospital  is  in  an  SMSA, 
and  a  state-based  per  capita  income  category.     The  number  of  variable  categories 
used  (7  for  bed  size,  2  for  SMSA/ non- SMSA,  and  5  for  per  capita  income)  results 
in  70  possible  cross  classification  cells  for  the  HCFA  system,  which  is  supposed 
to  represent  all  reasonable  combinations  of  hospital  size  and  economic  environment. 
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and  their  sample  means  are  as  follows: 


Manufacturing  wages  (hourly)    $4.16 

Transportation  &  public  utilities  wages  (hourly)   .   .    .   .  $4.63 

Retail  wages  (hourly)    $2.34 

Hospital  wages  (annual)  $7365.28 


The  rural  markets  variable  is  represented  by  the  uniform  pressure  occupancy 
index  (UPOI)  described  in  section  IV,  part  D,   Chapter  Two.     The  mean  value  of 
this  index  for  the  complete  hospital  sample  is  89  percent. 

The  endogenous  approach  uses  a  number  of  hospital  specific  measures  to  repre- 
sent case  mix  and  case  mix  severity.     These  case  mix  surrogate  measures  and  their 
mean  values  are  listed  below. 


Number  of  basic  services  (maximum  value  of  4)   ......  3 . 9 

Number  of  quality  enhancing  services  (maximum  value  of  7)  4.6 

Number  of  complex  services  (maximum  value  of  16)     ....  7.5 

Number  of  community  services  (maximum  value  of  17)     .   .  .4.4 

Number  of  births  per  discharge    0.095 

Number  of  surgical  operations  per  discharge    0.19 

Number  of  outpatient  visits  per  discharge    4.97 


To  measure  case  mix  severity,  the  percentage  of  the  county's  population  under 
5  years  old  and  the  percentage  of  the  population  over  65  years  were  combined 
under  the  assumption  that  those  two  age  groups  produce  the  most  severe  cases. 
A  second  severity  measure,  the  percentage  of  families  earning  less  than  $4,000 
yearly,  assumes  that  poverty  families  are  more  likely  to  be  treated  for  more 
severe  cases.     The  mean  values  for  these  measures  are  as  follows: 

Percentage  of  population  under  5  and  over  65  21% 

Percentage  of  families  earning  less  than  $4000  yearly  .   .  17% 

The  exogenous  approach  used  a  number  of  county  specific  measures  to  represent 
case  mix  and  case  mix  severity.     One  of  these  measures,  "heavy  demand  ages," 
was  determined  by  adding  the  percentage  of  the  population  under  5,  the  percen- 
tage of  population  over  65,  and  the  percentage  of  females  from  15  to  44  years 
old.     The  mean  values  for  all  exogenous  measures  are  listed  below. 


Heavy  demand  age   41.3% 

Percentage  of  families  earning  less  than  $4000  yearly  .   .   .  17.0% 

Family  income  (yearly),   .  $9146.29 

Labor  force  participation  rate   56.3% 

Disabled  rate  ages  16-44   10.1% 

Percentage  of  population  non-white    10.9% 

Percentage  of  M.D.'s  aged  60  and  over   20.9% 

Obstetricians/Gynecologists  per  10,000  population    0.0833 

Primary  care  M.D.'s  per  10,000  population    0.8134 

Medical  specialists  per  10,000  population    .  3.1877 

Direct  care  specialists  per  10,000  population    0.2699 

Other  specialists  per  10,000  population  ...    .   .  0.2196 

Surgical  specialists  per  10,000  population  «...  0.4062 
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The  three  variables  used  in  the  third  approach  based  on  the  HCFA  cross  classi- 
fication system  were  median  family  income  by  county,  total  number  of  hospital 
beds,  and  a  dichotomous  zero-one  variable  indicating  whether  or  not  the 
hospital  was  located  within  an  SMSA.     While  the  HCFA  system  uses  statewide 
average  family  income  per  state,   such  data  were  not  available  in  this  study. 
Hence,   the  county-based  measure  was  substitated .     The  three  measures  and  their 
mean  values  are  given  below. 

Median  family  income  (by  county)    $9146.30 

Total  number  of  hospital  beds   282.40 

Percentage  of  hospitals  within  a  SMSA    60.40% 

As  part  of  the  initial  examination  of  the  data  base,   the  simple  product  moment 
correlation  coefficient  between  all  pairs  of  measures  was  computed.  The 
correlation  coefficient  is  of  interest  since  it  indicates  the  degree  of  multicol- 
linearity  and  hence  the  implicit  weighting  present  among  the  measures.     We  found, 
for  example,   that  many  of  the  proposed  case  mix  measures  were  highly  correlated 
with  input  price  measures.     Failure  to  remove  this  multicollinearity  would  result 
in  implicit  increases  in  the  weights  on  both  the  input  price  and  case  mix  vari- 
ables.    Thus,   it  became  imperative  to  factor  analyze  the  measures  in  order  to 
attempt  to  isolate  the  parts  of  the  measures  corresponding  to  each  of  the 
dimensions 

A  subset  of  these  correlation  coefficients  is  given  in  Table  4.1.;  the  complete 
correlation  matrices  are  given  in  Appendix  B.     Due  to  the  large  sample  size, 
coefficients  with  absolute  values  as  low  as  0.20  are  statistically  significant. 

B.     Factor  Analysis 

Given  the  relatively  high  correlations  found  among  measures,  it  was  necessary  to 
factor  analyze  each  set  of  measures  in  order  to  eliminate  the  implicit  variable 
weighting  caused  by  these  multicollinearities .     In  essence,  the  factor  analysis 
simply  reduced  each  data  set  to  strictly  orthogonal  measures. 

Each  factor  analysis  was  computed  using  varimax  rotation  to  simplify  the  rows 
(i.e.,  variable  loadings)  as  much  as  possible,  and  an  eigenvalue  limit  of  1.0 
(the  default  value)  was  used.^   In  all  cases,   this  limit  resulted  in  factors 
which  accounted  for  a  minimum  of  80  percent  of  the  total  data  set  variation. 

The  first  set  of  measures  to  be  factor  analyzed  (the  endogenous  approach  measures) 
resulted  in  six  factors  which  accounted  for  80.3  percent  of  the  total  variation. 


Note  that  in  the  unlikely  case  that  all  measures  are  completely  orthogonal, 
there  would  exist  as  many  factors  as  measures  and  the  resultant  similarity 
scores  would  be  the  same  if  computed  from  factor  scores  or  from  the  standardized 
measures  themselves. 

4  6 

The  SPSS  (Statistical  Package  for  the  Social  Sciences)  program  was  used 
for  the  regression  and  factor  analyses. 
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TABLE  4.1 

Product  -  Moment  Correlation  Coefficients 


Endogenous  Measures 


Manufacturing  wages  with  percentage  of  families 

earning  less  than  $4,000  yearly  -0.69 

Transportation  &  public  utilities  wages  with 

percentage  of  families  earning  less  than  $4,000  .  .  .  .  .  -0.61 
Retail  wages  with  percentage  of  families  earning 

less  than  $4,000  yearly  -0.56 

Transportation  &  public  utilities  wages  with 

complex  services    +0.50 

Percentage  of  families  earning  less  than  $4,000 

yearly  with  ratio  of  operations/discharges    -0.53 


Exogenous  Measures 

Manufacturing  wages  with  percentage  of  families 

earning  less  than  $4,000  yearly   -0.69 

Manufacturing  wages  with  family  income                                      .  +0.73 

Transportation  &  public  utilities  wages  with 

Percentage  of  families  earning  less  than  $4,000    -0.61 

Transportation  &  public  utilities  wages  with 

family  income                                                                             .  +0.69 

Transportation  &  public  utilities  wages  with 

Ob/Gyn's  per  10,000  population                               ......  +0.53 

Retail  wages  with  percentage  of  families  earning 

less  than  $4,000  yearly   -0.56 

Retail  wages  with  Ob/Gyn's  per  10,000  population  .  .  .  .  .  +0.65 
Retail  wages  with  direct  care  specialists  per 

10,000  population   +0.52 

Retail  wages  with  other  specialists  per  10,000 

population                                                              .......  +0.53 

Retail  wages  with  surgical  specialists  per 

10,000  population    +0.50 

Hospital  wages  with  family  income    +0.56 


HCFA  Measures 


Family  income  with  total  number  of  hospital  beds    0.42 

Family  income  with  percentage  within  an  SMSA   0.66 

Total  number  of  hospital  beds  with  percentage  within  an  SMSA  0.56 
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The  factors  found  in  this  case  appeared  to  be  well  defined;  each  measure  tended 
to  load  highly  on  only  one  factor.     Therefore,  it  was  relatively  straight- 
forward to  interpret  each  factor.     The  first  factor  appeared  to  be  the  input 
price  factor,  the  third  factor  appeared  to  be  a  case  mix  severity  factor,  and 
the  other  ractors  appeared  to  measure  case  mix. 
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Further  interpretation  is  provided  below. 


Factor  1:     The  four  input  price  measures,  hospital  wages  (.59),  manufacturing 
wages  (.63),  transportation  and  public  utilities  hourly  wages  (.77) 
and  retail  hourly  wages  (.74),  have  their  highest  loadings  on  this 
factor  .    One  measure — the  ratio  of  surgical  operations  to  discharges 
(.48) — which  was  hypothesized  to  represent  case  mix,  also  loads 
highest  on  this  factor,  although  it  loads  highly  on  factors  2,  3,  and 
4  (.38,    .25,  and  .25)  as  well  (thereby  providing  considerable  input 
to  the  two  case  mix  factors).     Another  measure — the  percentage  of 
families  earning  less  than  $4000 — also  loaded  highest  on  this 
factor  (-.60).     Its  loading  on  the  third  factor,  which  apparently 
was  the  case  mix  severity,  factor  was  -.56. 

Factor  _2:  The  three  service  measures — the  number  of  quality  enhancing  services 
(.55),  the  number  of  complex  services  (.83),  and  the  number  of  basic 
services  (.68) — all  load  highest  on  this  case  mix  factor. 

Factor  3:     The  measure,  high  demand  ages,  has  its  highest  loading  (-.55)  on 

this  case  mix  severity  factor;  in  addition,  the  other  hypothesized 
case  mix  severity  measure,  percentage  of  families  earning  less 
than  $4,000  annually,  has  its  second  highest  loading  (-.56)  on 
this  factor. 


Factor  4^:     The  measure  of  the  number  of  basic  services  loads  the  highest  on 

this  factor  (.49),  and  the  number  of  quality  enhancing  services  has 
its  second  highest  loading  (.53)  on  this  case  mix  factor. 

Factor  _5:     The  ratio  of  outpatient  visits  to  discharges  has  its  highest  loading 
(.50)  and  the  number  of  community     services  has  its  second  highest 
loading  (.49)  on  this  additional  case  mix  factor. 

Factor  6:     The  ratio  of  births  to  discharges  has  its  only  sizeable  loading  (.36) 
on  this  fourth  and  final  case  mix  factor. 


The  set  of  exogenous  measures  was  also  subjected  to  factor  analysis,  using  the 
same  eigenvalue  limit  (1.0)  and  varimax  rotation.     In  this  case,  five  factors 
were  found  which  accounted  for  83.1  percent  of  the  total  variation.  Three 
factors  were  apparently  case  mix  factors  (factors  1,  3,  and  4),  the  second 
factor  was  apparently  the  input  price  factor,  and  one  factor  (the  fifth)  was 
difficult  to  interpret  and  therefore  discarded  due  to  the  fact  that  only  the 
measure  of  the  labor  force  participation  rate  loaded  highly  on  this  factor. 
Individual  factor  interpretations  are  given  below. 


The  numbers  in  parentheses  indicate  factor  loadings. 
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Every  measure  of  the  supply  of  medical  personnel  (i.e.,  OB/Gyn's 
per  10,000  population  (.81),  primary  care  M.D.'s  per  10,000 
population  (.95),  medical  specialists  per  10,000  population  (.95), 
specialists  per  10,000  population  (.95),  and  surgical  specialists 
per  10,000  population  (.94))  loaded  very  highly  on  this  factor, 
clearly  identifying  it.as  a  case  mix  factor. 

The  four  input  price  measures  load  highest  on  this  factor,  with 
the  three  county  based  wage  rates  having  a  minimum  loading  of 
0.74  and  the  hospital  based  wage  rate  having  a  loading  of  0.57. 
In  addition,   the  percentage  of  families  earning  less  than  $4000 
annually  (-.80)  as  well  as  family  income  (.87)  load  highest  on 
this  factor.     The  measures  for  the  labor  force  participation 
rate  and  the  disabled  rate  have  their  second  highest  loadings 
on  this  factor  (.53  and  -.39,  respectively),  supporting  the  assump- 
tion that  this  factor  is  extracting  the  "prices"  variance  in  all 
measures. 

One  measure  has  its  highest  loadings  on  this  factor — the  percentage 
of  non-whites  (.73),  and  another  measure — the  disabled  rate — has 
its  second  highest  loading  (.35)  on  this  additional  case  mix  factor. 

This  factor  defines  the  third  case  mix  factor.     The  measure  of 
high  demand  ages  (including  the  percentage  of  females  aged  15  to 
44)   (.45),   the  disabled  rate  (.42),  and  the  percentage  of  M.D.'s 
older  than  60  (.27)  have  their  highest  loadings  on  this  factor. 

In  addition  to  the  two  sets  of  case  mix  and  input  price  measures,   the  measure 
for  the  rural  markets  variable  (i.e.,   the  uniform  pressure  occupancy  index) 
was  included  in  a  preliminary  factor  analysis.     Not  surprisingly,   it  was  found 
that  this  measure  always  loaded  highest  on  the  input  price  factor,  since  the 
variation  in  rural-urban  differences  is  highly  associated  with  price  differences. 
However,   since  the  uniform  pressure  occupancy  index  is  the  only  measure  for 
rural  markets,   it  was  not  necessary  to  include  this  measure  in  the  subsequent 
factor  analyses.     The  standardized  values  of  the  uniform  pressure  occupancy 
index,  comparable  to  the  standardized  factor  scores,  were  therefore  treated 
as  an  additional  score  to  be  included  in  the  computation  of  distance  measures. 
On  balance,  the  conceptual  advantage  of  having  a  representative  measure  for 
rural  markets  was  believed  to  outweigh  the  effects  of  the  implicit  weighting 
resulting  from  any  colinearity  between  the  uniform  pressure  occupancy  index 
and  input  prices. 

Factor  analysis  was  also  performed  on  the  variables  used  by  HCFA  in  their  cross 
classification  system.     Two  factors  were  found  which  accounted  for  100  percent 
of  the  total  variation.     The  family  income  measure  loaded  highest  on  one  factor, 
and  the  number  of  total  beds  loaded  highest  on  the  second  factor.     The  dichoto- 
mous  variable  indicating  the  location  of  a  hospital  within  a  SMSA  loaded 
approximately  equally  on  both  factors  (0.65  and  0.61,  respectively). 

C.     Variable  Weight  Selection 

The  use  of  factor  scores  in  place  of  the  actual  variable  measures  removed  any 
implicit  weightings  due  to  multicollinearities .     Two  explicit  weighting  schemes 
were  then  used  to  define     the  relative  importance  of  each  variable. 


Factor  1: 


Factor  2: 


Factor  3: 


Factor  4: 
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The  first  weighting  scheme  attempted  to  equalize  the  weights  among  each 
hypothesized  variable  (i.e.,  input  price  variable,  the  rural  markets  variable, 
etc.)  under  the  assumption  that  there  is  no  a  priori  reason  to  consider  one 
variable  more  important  than  any  other  in  the  determination  of  hospital  costs. 
Given  the  use  of  factor  scores,  however,  and  the  fact  that  more  than  one 
factor  had  been  identified  for  some  variables  (e.g.  case  mix),  it  was  necessary 
to  reduce  the  weight  of  some  factors  below  unity.     If,  for  example,  a  variable 
had  several  clearly  associated  factors,  each  associated  factor  was  assigned 
an  equal  proportion  of  the  unit  weight.     The  factor  weights  determined  in  this 
manner  are  displayed  in  Tables  4.2,  4.3,  and  4.4. 

The  second  weighting  scheme  relaxed  the  assumption  that  all  variables 
contribute  equally  to  hospital  costs,  and  attempted  to  determine  each  variable's 
relative  importance  by  regressing  the  factor  scores  and  the  uniform  pressure 
occupancy  index  on  cost  per  case.^  If  a  factor  was  statistically  significant 
in  the  regression  equation,  the  standardized  regression  coefficient  for  that 
factor  was  used  as  its  weight.     A  weight  of  zero  was  assigned  to  factors 
whose  coefficients  were  not  significantly  different  from  zero.     The  regression 
weights  found  for  the  three  sets  of  variables  are  shown  in  Table  4.5.     It  is 
easily  seen  that  these  weights  differ  significantly  from  those  developed  by 
the  unit  weighting  scheme. 

The  use  of  the  two  weighting  systems  and  the  three  sets  of  measures  produced 
a  total  of  six  data  sets  to  be  analyzed.     These  six  sets  and  their  respective 
mnemonics  are  as  follows: 

1)  Endogenous  variables  with  unit  weights  (EnU) 

2)  Exogenous  variables  with  unit  weights  (ExU) 

3)  HCFA  variables  with  unit  weights  (HCFAU) 

4)  Endogenous  variables  with  regression  weights  (EnR) 

5)  Exogenous  variables  with  regression  weights  (ExR) 

6)  HCFA  variables  with  regression  weights  (HCFAR) . 


The  use  of  cost  per  case  is  an  admittedly  inadequate  and  gross  repre- 
sentation of  cost,  but  it  was  used  as  the  best  alternative  of  the  data  available 
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Numerous  other  explicit  weighting  schemes  exist.     For  example,  weights 
for  any  factor  might  have  been  determined  by  the  percentage  of  explained 
variance  accounted  for  by  that  factor  in  the  factor  analysis.     Other  possi- 
bilities include  the  use  of  the  Automatic  Interaction  Detector  (AID)  (Morgan 
and  Sonquist,  1973)  to  eliminate  the  dependency  on  a  predefined  functional 
form,  or  an  iterative  scheme  which  re-estimates  the  weights  as  each  group 
of  homogeneous  hospitals  is  found.     In  future  work,  a  number  of  these 
approaches  will  be  explored. 
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II.     Cluster  Analysis  Results:     Subsample  of  194  Hospitals 

In  order  to  test  the  feasibility  of  the  methodology  and  to  test  sample  versus 
population  results,  a  random  subsample  of  194  hospitals  was  selected  from 
the  complete  data  base  of  1070  hospitals.     On  the  basis  of  the  sampling  proce- 
dure, as  well  as  subsequent  examination  of  the  subsample,  it  appeared  that 
the  subsample  accurately  mirrored  the  complete  sample. 

Based  on  the  three  sets  of  measures  and  two  sets  of  weights,  six  sets  of 
similarity  measures  were  calculated.     In  each  set,  the  similarity  measure 
between  two  hospitals  was  the  squared  Euclidean  distance  calculated  from  the 
weighted  factor  scores.     As  described  in  the  third  chapter,  six  clustering 
algorithms  (complete  linkage,  average  linkage  between  groups,  average  linkage 
within  groups,  median  method  of  Gower,  centroid  method,  and  Ward's  subopti- 
mization  method)  were  then  applied  to  each  similarity  matrix.     The  outputs  from 
these  algorithms  were  combined  to  produce  a  single  composite  dendrogram  for 
each  of  the  six  variable  sets.     These  composite  dendrograms  were  then  analyzed 
to  find  reasonable  sets  of  hospital  partitions. 

A  composite  similarity  measure  between  two  hospitals  represents  the  (weighted) 
average  joining  class  level  at  which  those  two  hospitals  were  joined  by  the 
six  individual  algorithms.     In  order  to  compute  such  an  average,  each  algorithm' 
results  are  weighted  by  the  square  of  its  respective  absolute  cophenetic 
correlation  coefficient  (i.e.,  the  correlation  coefficient  between  the  vector 
of  pairwise  similarity  measures  and  the  vector  of  joining  class  levels).  The 
absolute  cophenetic  correlation  coefficients  for  the  six  variable  sets  and  the 
six  individual  algorithms  are  shown  in  Table  4.6,  and  empirically  support  the 
use  of  a  composite  dendrogram.     An  examination  of  Table  4.6  reveals  that  no 
single  individual  method  consistently  outperforms  any  other  method  (the  maximum 
cophenetic  correlation  coefficient  for  each  data  set  is  indicated  by  a  "*") . 
Of  the  six  methods  used,  three  of  the  methods  have  maximum  values  at  some  time. 
On  the  other  hand,  it  is  readily  apparent  that  in  most  cases  the  differences 
between  methods,  at  least  in  terms  of  the  cophenetic  correlation  coefficient, 
are  not  significantly  different.     On  the  basis  of  the  results  in  Table  4.6, 
it  would  be  difficult  to  justify  selecting  the  results  from  one  algorithm 
with  a  correlation  coefficient  of,  say,  0.74  while  completely  neglecting  the 
results  from  another  algorithm  with  a  correlation  coefficient  of  0.73  (which 
would  be  the  case  for  the  EnU  approach) . 

Given  the  six  weighted  similarity  matrices,  the  composite  dendrograms 
were  found  for  all  six  data  sets:     EnR,  ExR,  HCFAR,  EnU,  ExU,  and  HCFAU. 
These  composite  dendrograms  and  their  respective  group  structures  are  pre- 
sented in  Figures  4.1  to  4.6. 

Each  composite  dendrogram  was  analyzed  using  the  concept  of  expected  distinctive 
ness  defined  in  the  third  chapter.     The  results  of  these  analyses,  given  in 
Table  4.7  (for  unit  weights)  and  Table  4.8  (for  regression  weights),  summarize 
the  optimal  and  suboptimal  group  structures  as  determined  by  the  values  of  expec 
distinctiveness.     The  number  following  the  partition  identification  (i.e., 
optimal  or  suboptimal)  indicates  the  number  of  hospital  groups  in  that  parti- 
tion; the  precise  definition  of  these  groups  is  given  under  the  "group  structure 
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Composite  Dendrogram:     Exogenous  Approach,  Regression  Weights 
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Figure  4.2 


Composite  Dendrogram:     HCFA  Approach,  Regression  Weights 
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Figure  4.3 


Composite  Dendrogram:     Endogenous,  Unit  Weights 
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heading.     For  example,  in  the  EnU  variable  set,  the  first  112  hospitals  in 
that  composite  dendrogram  are  grouped  together,  the  next  9  hospitals  form  the 
second  group,  etc.  in  the  optimal  partition.     It  is  interesting  to  note  that 
the  percentage  of  hospitals  correctly  classified  by  a  linear  discriminant 
analysis  closely  corresponds  to  the  values  of  the  expected  distinctiveness 
measure.     From  these  tables,  it  would  appear  that  the  partitions  can  be  well 
discriminated  by  linear  functions,  and  that  the  values  of  expected  distinctive- 
ness reflect  this  ability. 

Further  partition  validation  was  provided  by  several  one-way  analyses  of 
variance  (Table  4.9)  using  the  identified  groups  as  independent  variables,  and 
the  weighted  factors  used  to  create  the  original  similarity  measures  as  depen- 
dent variables."^   In  the  case  of  unit  weights,  all  factors  were  significant 
at  the  .05  level;  in  fact,  all  factors  except  one  case  mix  factor  (factor  5: 
weight=0.25)  were  significant  at  the  .01  level.     When  using  regression  weights, 
however,  the  results  were  significantly  different.     In  this  case  factors  3  and 
4  in  the  endogenous  variable  set  are  not  significant  in  determining  hospital 
groups;  examination  of  Table  4.5  reveals  that  the  weights  on  both  these  factors 
were  zero  since  they  were  not  statistically  significant  in  the  regression 
equation  from  which  the  weights  were  drawn. 

Ill .     Cluster  Analysis  Results:     1070  Hospital  Sample 

Three  variable  sets  were  used  in  the  analysis  of  the  full  sample:     EnR,  EnU,  and 
ExR.     Due  to  cost  limitations,   it  was  necessary  to  restrict  the  number  of  algo^- 
rithms  used  for  each  variable  set.     Therefore,  a  composite  dendrogram  based  on 
three  representative  algorithms  (instead  of  the  six  used  to  analyze  the  194 
hospital  subsample)  was  produced  for  the  endogenous  variable  set  with  regression 
weights,  and  a  single  algorithm  was  used  for  the  other  two  variable  sets. 

The  selection  of  algorithms  for  the  EnR  approach  was  determined  by  the  size  of 
the  absolute  cophenetic  correlation  coefficients  between  clustering  algorithms 
and  their  respective  similarity  matrices  (Table  4.6),  and  by  type  of  algorithm. 
An  attempt  was  made  to  group  the  algorithms  themselves  on  the  basis  of  mathe- 
matical similarity,  and  the  algorithm  with  the  highest  cophenetic  correlation 
coefficient  was  then  chosen  for  each  group.     This  admittedly  inexact  selection 
process  resulted  in  the  use  of  the  complete  linkage,  average  linkage  between 
groups,  and  centroid  methods.     On  the  basis  of  their  performance  in  analyzing 
the  194  hospital  subsample,  it  was  felt  that  these  algorithms  were  most  likely 
to  result  in  a  composite  dendrogram  most  similar  to  the  composite  dendrogram 
which  would  have  been  determined  if  all  six  algorithms  had  been  used. 

For  the  EnU  and  ExR  approaches,   the  average  linkage  between  groups  algorithm  was 
selected  since  it  had  the  highest  absolute  cophenetic  correlation  coefficient 
among  all  approaches  for  the  hospital  subsample  (Table  4.6). 


The  individual  factors  and  their  respective  unit  weights  are  identified 
in  Tables  4.2,  4.3,  and  4.4;  the  regression  weights  used  for  each  factor  are 
given  in  Table  4.5.     The  last  factors  for  the  endogenous  and  exogenous  variable 
sets  in  Table  4.9  (i.e.,  factors  7  and  5,  respectively)  represent  the  uniform 
pressure  occupancy  index. 
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The  determination  and  examination  of  suboptimal  partitions  proceed  in  the 
same  manner  used  to  analyze  the  smaller  subsample  as  described  in  the  third 
chapter.     Examination  of  suboptimal  partitions  was  discontinued  when  it  became 
evident  that  further  breakdown  would  result  in  a  very  large  decrease  in  the 
overall  expected  distinctiveness  value,   E(D),  and  the  atomization  of  the 
partition  into  many  small  groups.     Tables  4.10,  4.11,  and  4.12  provide  infor- 
mation on  the  configuration  of  the  optimal  and  suboptimal  partitions  found  for 
the  three  approaches  examined.     Descriptions  of  the  groups  found  in  this 
analysis  are  given  in  Appendix  A. 

Cluster  validation  proceeded  in  a  fashion  similar  to  the  approach  used  for 
the  smaller  subsample.     Linear  discriminant  analysis  was  initially  performed; 
the  results  are  presented  in  Table  4.13.     As  with  the  subsample  results,  there 
appears  to  exist  a  close  relationship  between  the  value  of  expected  distinc- 
tiveness and  the  percentage  of  hospitals  correctly  classified  by  the  linear 
discriminant  functions.     Moreover,   the  percentage  of  hospitals  correctly 
classified  is  strikingly  high,  which  again  strongly  suggests  that  the  hospital 
groups  are  fairly  well-defined,   compact,  a  separable  by  linear  functions.  It 
also  suggests  that  once  a  sample  of  hospitals  has  been  clustered  and  used  to 
determine  the  discriminant  functions,  all  hospitals  might  be  classified  by 
these  discriminant  functions  with  a  fairly  high  degree  of  accuracy. 

The  analysis  of  variance  results  (Tables  4.14,  4.15,  and  4.16)  are  informative 
in  several  ways.  First,  all  factors  in  every  group  structure  have  highly 
significant  F  ratios  (significant  at  a  .001  level),   indicating  that  every 
group  structure  successfully  discriminates  on  the  factors  used  to  create  the 
structures. 

It  is  also  interesting  to  examine  the  absolute  values  of  the  F  ratios,  which 
give  the  ratio  of  the  between  group  sum  of  squares  to  the  within  group  sum  of 
squares  (i.e.,  a  high  F  ratio  indicates  that  most  of  a  measure's  variation 
occurs  between  groups  rather  than  within  groups).     The  results  in  Tables  4.14, 
4.15  and  4.16  indicate  that  a  strong  relationship  exists  between  the  weights 
attached  to  each  factor  and  their  respective  F  values.     In  almost  all  cases, 
those  factors  with  the  highest  weights  also  have  the  highest  F  ratios;  for 
example,   in  the  exogenous  variable  set  with  regression  weights,  factor  1  has 
the  largest  weight  (.375)  and  greatest  associated  F  ratio  for  all  partitions. 

In  order  to  test  the  relationship  between  the  factor  weights  and  F  ratios 
further,   the  simple  product-moment  correlation  coefficient  was  calculated  between 
these  two  vectors  for  all  three  approaches.     The  analysis  of  all  three  variable 
sets  resulted  in  values  of  0.77  or  higher.     In  addition,  given  the  small  differ- 
ences between  factor  weights  and  the  significant  differences  between  F  values 
in  some  cases,   it  would  also  appear  from  this  analysis  that  resultant  clusters 
are  fairly  sensitive  to  changes  in  the  similarity  measures  in  general  and  the 
weights  in  particular. 

A.     Interpretation/Evaluation  of  Group  Structures 

The  optimal  and  suboptimal  (in  terms  of  the  expected  distinctiveness)  partitions 
found  for  the  three  variable  sets  (i.e.,  EnR,  EnU,  and  ExR)  were  compared  using 
the  A  measure  developed  in  the  third  chapter.     Basically,   the  A  measure  is  a 
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TABLE  4.10 

Partitions!    Endogenous  Approach  -  Regression  Weights    (N  »  1070  Hospitals) 


Subopt  2  (E(D)  =  3771) 
(splits  #101  of  subopt  1) 


Subopt  1 

(E(D)  -  3947) 

Group 

Size 

(splits  #1  of  optimal) 

f 

10101 

14 

10102 

50 

Group 

Size 

10103 

7 

10104 

12 

'  101 

312  

10105 

25 

102 

32 

10106 

30 

103 

1 

10107 

74 

Optimal 

(E(D)  -  4887) 

104 

1 

10108 

19 

105 

30 

10109 

81 

Group 

Size 

106 

48 

107 

4 

35  Groups 

Total 

1 

737   

108 

17 

2 

174 

109 

13 

3 

53 

110 

100 

4 

1 

111 

32 

5 

1 

112 

14 

6 

20 

113 

12 

7 

3 

114 

122 

8 

1 

115 

8 

9 

3 

10 

3 

27  Groups  Total 

11 

69 

12 

4 

13 

1 

13  Groups  Total 
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TABLE  4.11 

Partitions;    Endogenous  Approach  -  Unit  Weights    (N  =  1070  Hospitals) 

Subopt  3  (E(D)  =  2748) 
(splits  #1  of  optimal) 


Group 

* 

Size 

101 

400 

Optimal  E(D) 

-  4888) 

102 

6 

103 

13 

Group 

Size 

104 

7 

105 

2 

1 

788   ►  « 

106 

216 

107 

36 

108 

33 

2 

42 

109 

25 

110 

35 

,  111 

15 

3 

1 

26  Groups 

Total 

4 

12 

5 

1 

10 


216 — ► 


Subopt  1  (E(D)  =  4479) 
(splits  #9  of  optimal) 
Group  Size 


901 
902 
903 


14  Groups  Total 


Subopt  2  (E(D)  -  4327) 
(splits  #101  of  subopt  1) 
Group  Size 


[90101 
214  ►  { 90102 

1 


[90103 

16  Groups  Total 


113 
62 
39 


11 


12  1 
12  Groups  Total 
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TABLE  4.12 

Partitions:     Exogenous  Approach  -  Regression  Weights  (N-1070  Hospitals) 


Optimal  (E(D)  =  9700) 

Group  Size 

1  783   


Subopt  3  (E(D)  =  2614) 
(splits  #1  of  optimal) 

Group  Size 

478 
304 
1 

17  Groups  Total 


219 


Subopt  2  (E(D)  =  8873) 
(splits  #2  of  optimal) 


Group 

201 
202 
[  203 


Size 

13 
24 
182 


15  Groups  Total 


4 
5 
6 
7 
8 
9 
10 


44 
1 

14 
4 
1 
1 
2 


Subopt  1  (E(D)  =  9480) 
(splits  #4  of  optimal) 


Group 

401 
402 
403 
404 


Size 

19 
15 
5 
5 


13  Groups  Total 


10  Groups  Total 
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TABLE  4.13 

Linear  Discriminant  Analysis  Results     (N  =  1070  Hospitals) 


%  Correctly 

Variable  Set      Partition  No.  of  Groups  E(D)*  Classified 


EnR  Optimal  13  4887  89.9 

Suboptimal  -  1  27  3947  78.2 

Suboptimal  -  2  35  3771  77.8 


EnU  Optimal  12  4888  94.4 

Suboptimal  -  1  14  4479  94.6 

Suboptimal  -  2  16  4327  93.0 

Suboptimal  -  3  26  2748  88.0 


ExR                     Optimal  17  9700  95.5 

Suboptimal  -  1  13  9480  95.7 

Suboptimal  -  2  15  8873  94.8 

Suboptimal  -  3  17  2614  94.4 


^Expected  distinctiveness 
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TABLE    4 . 14 

Analysis  of  Variance:  Endogenous  Approach  -  Regression  Weights. 
(N  =  1070  Hospitals) 


OPTIMAL  GROUP  STRUCTURE*  BETWEEN  GROUP      WITHIN  GROUP 


(13  Groups) 

D.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

T71     Tin  fXD 

r  rKUB 

Factor  1  (Prices  : 

wt= 

.267) 

12 

31.77 

1057 

.43 

74.27 

.000 

Factor  2  (Case  Mix: 

wt= 

.251) 

12 

25.34 

1057 

.53 

/.  "7     "7  o 

4  /  .  /o 

r\r\r\ 

.  uuu 

Factor  3  (Severity: 

wt= 

.039) 

12 

13.20 

1057 

.37 

35 . 40 

r\t~\r\ 

.  000 

Factor  4  (Case  Mix: 

wt= 

.003> 

12 

4.15 

1057 

.46 

.  000 

Factor  5  (Case  Mix: 

wt= 

.446) 

12 

26.59 

1057 

.22 

1  1  O  CO 

LA  J . jo 

.000 

Fa(ctor  6  (Case  Mix: 

wt= 

.066) 

12 

1.91 

1057 

.24 

"7    Q  C 
1  .  OO 

.000 

Factor  7  (UPOI: 

wt= 

.145) 

12 

59.48 

1057 

.34 

1  "7  "7     C\  C 

1/  / . 05 

.  000 

1st  SUBOPTIMAL  STRUCTURE** 

BETWEEN  GROUP 

WITHIN  GROUP 

(27  Groups) 

D.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F  PR0B 

Factor  1  (Prices: 

wt=? 

.267) 

26 

23.21 

1043 

.22 

i  ni;    o o 
±0_> .  jj 

.  000 

Factor  2  (Case  Mix: 

wt= 

.251) 

26 

24.70 

1043 

.21 

115.84 

.000 

Factor  3  (Severity: 

wt= 

.039) 

26 

7.31 

1043 

.35 

21.03 

.000 

Factor  4  (Case  Mix: 

wt= 

.003 

26 

4.72 

1043 

.39 

12.06 

.000 

Factor  5  (Case  Mix: 

wt- 

.446) 

26 

16.52 

1043 

.11 

147.40 

.000 

Factor  6  (Case  Mix: 

wt= 

.066) 

26 

1.41 

1043 

.23 

6.04 

.000 

Factor  7  (UPOI: 

wt= 

.145) 

26 

30.57 

1043 

.26 

116.30 

.000 

2nd  SUBOPTIMAL  STRUCTURE*** 

BETWEEN  GROUP 

WITHIN  GROUP 

(35  Groups) 

D.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F  PROB 

Factor  1  (Prices: 

wt= 

.267) 

34 

19.75 

1035 

.16 

126.26 

.000 

Factor  2  (Case  Mix: 

wt= 

.251) 

34 

20.92 

1035 

.15 

141.09 

.000 

Factor  3  (Severity: 

wt= 

.039) 

34 

5.74 

1035 

.35 

16.62 

.000 

Factor  4  (Case  Mix: 

wt= 

.003) 

34 

3.94 

1035 

.38 

10.28 

.000 

Factor  5  (Case  Mix: 

wt= 

.446) 

34 

13.27 

1035 

.09 

144.34 

.000 

Factor  6  (Case  Mix: 

wt= 

.066) 

34 

1.67 

1035 

.23 

5.03 

.000 

Factor  7  (UPOI: 

wt= 

.145) 

34 

24.66 

1035 

.22 

110.66 

.000 

*        Highest  value  of  Expected  Distinctiveness 

**      Second  highest  value  of  Expected  Distinctiveness 

***    Third  highest  value  of  Expected  Distinctiveness 
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TABLE  4.15 


Analysis  of  Variance :  Endogenous 

Approach  -  Unit  Weij; 

;hts   (N  = 

1070  Hospitals) 

OPTIMAL  GROUP  STRUCTURE 

BETWEEN  GROUP 

WITHIN  GROUP 

(12  Groups) 

D 

.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F 

PROB 

Factor  1  (Prices:  wt=1.00) 

11 

31.12 

1058 

.46 

67.06 

.000 

Factor  2  (Case  Mix:  wt=  .25) 

11 

12.73 

1058 

.68 

18.59 

.000 

Factor  3  (Severity:  wt=1.00) 

11 

27.90 

1058 

.23 

120.28 

.000 

Factor  4  (Case  Mix:  wt=  .25) 

11 

6.80 

1058 

.43 

15.78 

.000 

Factor  5  (Case  Mix:  wt=  .25) 

11 

16.69 

1058 

.34 

48.65 

.000 

Factor  6  (Case  Mix:  wt=  .25) 

11 

2.42 

1058 

.24 

10.10 

.000 

Factor  7  (UPOI:  wt=1.00) 

11 

61.95 

1058 

.37 

169.15 

.000 

1st  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN  GROUP 

(14  Groups) 

D 

.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F 

PROB 

Factor  1  (Prices:  wt=1.00) 

13 

26.50 

1056 

.46 

57.24 

.000 

Factor  2  (Case  Mix:  wt=  .25) 

13 

11.25 

1056 

.68 

16.50 

.000 

Factor  3  (Severity:  wt=1.00) 

13 

23.74 

1056 

.23 

102.85 

.000 

Factor  4  (Case  Mix:  wt=  .25) 

13 

5.80 

1056 

.43 

13.45 

.000 

Factor  5  (Case  Mix:  wt=  .25) 

13 

14.26 

1056 

.34 

41.69 

.000 

Factor  6  (Case  Mix:  wt=  .25) 

13 

2.68 

1056 

.23 

11.56 

.000 

Factor  7  (UPOI:  wt=1.00) 

13 

52.47 

1056 

.37 

143.24 

.000 

2nd  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN  GROUP 

(16  Groups) 

D 

.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F 

PROB 

Factor  1  (Prices:  wt=1.00) 

15 

23.38 

1054 

.46 

51.05 

.000 

Factor  2  (Case  Mix:  wt=  .25) 

15 

10.21 

1054 

.68 

15.12 

.000 

Factor  3  (Severity:  wt=1.00) 

15 

23.30 

1054 

.19 

121.05 

.000 

Factor  4  (Case  Mix:  wt=..25) 

15 

5.81 

1054 

.42 

13.79 

.000 

Factor  5  (Case  Mix:  wt=  .25) 

15 

12.54 

1054 

.34 

36.86 

.000 

Factor  6  (Case  Mix:  wt=  .25) 

15 

2.49 

1054 

.23 

10.82 

.000 

Factor  7  (UPOI:  wt=1.00) 

15 

51.67 

1054 

.28 

185.27 

.000 

3rd  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN  GROUP 

(26  Groups) 

D 

.F. 

MEAN  SQ 

D.F. 

MEAN  SQ 

F  RATIO 

F 

PROB 

Factor  1  (Prices:  wt=1.00) 

25 

24.78 

1044 

.29 

120. 95 

.000 

Factor  2  (Case  Mix:  wt=  .25) 

25 

14.78 

1044 

.47 

31.16 

.000 

Factor  3  (Severity:  wt=1.00) 

25 

14.87 

1044 

.17 

85.88 

.000 

Factor  4  (Case  Mix:  wt=  .25) 

25 

8.98 

1044 

.29 

30.61 

.000 

Factor  5  (Case  Mix:  wt=  .25) 

25 

8.16 

1044 

.33 

24.86 

.000 

Factor  6  (Case  Mix:  wt=  .25) 

25 

3.40 

1044 

.19 

18.25 

.000 

Factor  7  (UPOI:  wt=1.00) 

25 

27.80 

1044 

.12 

318.14 

.000 
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TABLE  4.16 

Analysis  of  Variance;     Exogenous  Approach  -  Regression  Weights  (N  =  1070  Hospitals) 


OPTIMAL  GROUP  STRUCTURE  BETWEEN  GROUP      WITHIN  GROUP 


(10  Groups) 

D.F. 

MEAN  SQ 

D.F.     MEAN  SQ 

F  RATIO 

F  PROB 

Factor  1  (Case  Mix: 

wt=.375) 

9 

100.66 

1060 

.14 

701.05 

.000 

Factor  2  (Prices: 

wt=.330) 

9 

57.13 

1060 

.43 

133.19 

.000 

Factor  3  (Case  Mix: 

wt=.035) 

9 

8.63 

1060 

.77 

11  20 

.  000 

Factor  4  (Case  Mix: 

wt=.054) 

9 

7.61 

1060 

.47 

16.18 

.000 

Factor  5  (UPOI: 

wt-.115) 

9 

67.41 

1060 

.44 

154.57 

.000 

1st  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN 

GROUP 

(13  Groups) 

D.F. 

MEAN  SQ 

D.F.    MEAN  SQ 

F  RATIO 

F  PROB 

Factor  1  (Case  Mix: 

wt=.375) 

12 

76.90 

1057 

.13 

600.49 

.000 

Factor  2  (Prices: 

wt=.330) 

12 

43.93 

1057 

.42 

105.16 

.000 

Factor  3  (Case  Mix: 

wt=.035) 

12 

8.45 

1057 

.75 

11.27 

.000 

Factor  4  (Case  Mix: 

wt=.054) 

12 

7.07 

1057 

.46 

15.50 

.000 

Factor  5  (UPOI: 

wt=.115) 

12 

50.67 

1057 

.44 

116.18 

.000 

2nd  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN 

GROUP 

(15  Groups) 

D.F. 

MEAN  SQ 

D.F.    MEAN  SQ 

F  RATIO 

F  PROB 

Factor  1  (Case  Mix: 

wt=.375) 

14 

66.02 

1055 

.13 

520.43 

.000 

Factor  2  (Prices: 

wt=.330) 

14 

39.71 

1055 

.39 

101.48 

.000 

Factor  3  (Case  Mix: 

wt=.035) 

14 

13.72 

1055 

.66 

20.61 

.000 

Factor  4  (Case  Mix: 

wt=.054) 

14 

6.21 

1055 

.46 

13.65 

.000 

Factor  5  (UPOI: 

wt=.115) 

14 

46.26 

1055 

.40 

115.81 

.000 

3rd  SUBOPTIMAL  STRUCTURE 

BETWEEN  GROUP 

WITHIN 

GROUP 

(17  Groups) 

D.F. 

MEAN  SQ 

D.F.  MEAN  SQ 

F  RATIO 

F  PROB 

Factor  1  (Case  Mix: 

wt=.375) 

16 

57.88 

1053 

.12 

461.18 

.000 

Factor  2  (Prices: 

wt=.330) 

16 

50.06 

1053 

.16 

313.94 

.000 

Factor  3  (Case  Mix: 

st=.035) 

16 

13.89 

1053 

.64 

21.76 

.000 

Factor  4  (Case  Mix: 

wt=.054) 

16 

7.29 

1053 

.43 

17.05 

.000 

Factor  5  (UPOI:  wt= 

.115) 

16 

46.60 

1053 

.31 

151.75 

.000 
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weighted  matching  coefficient  using  all  pairwise  hospital  combinations  as 
observations,  where  the  weights  are  determined  under  the  hypothesis  that  group- 
ing occurs  at  random.     It  was  shown  in  the  third  chapter  that  random  grouping 
results  in  an  expected  value  of  A  equal  to  0.5.     Due  to  the  large  number 
of  hospitals  in  this  sample  (i.e.,  1070  hospitals)  and  the  even  larger  number 
of  pairwise  combinations  (i.e.,  571,915  possible  pairs),  a  25  percent  random 
sample  of  pairwise  combinations  was  used  in  calculating  A.     Tests  showed  that 
little,  if  any,  accuracy  was  lost  by  this  approach. 

The  first  set  of  comparisons  examined  the  differences  between  partitions  found 
with  the  EnR  and  the  EnU  approaches  (i.e.,  the  variables  remained  the  same  but 
the  weights  varied).     Optimal  partitions  in  both  approaches  resulted  in  12  groups; 
the  A  measure  calculated  between  these  partitions  showed  them  to  be  highly 
similar  (A  =  0.81).     As  evident  from  Table  4.17,  the  most  distinctive  partition 
for  the  EnR  approach  is  highly  similar  with  the  most  distinctive,  first  suboptimal, 
and  second  suboptimal  partitions  found  for  the  EnU  approach.     Comparisons  of  the 
most  distinctive  EnR  partition  with  the  third  suboptimal  EnU  partition,  however, 
show  little  similarity;  the  null  hypothesis  that  grouping  has  occurred  randomly 
cannot  be  rejected  with  A  equal  to  0.462.     Clues  to  the  cause  of  this  change  in 
A  are  provided  by  the  analysis  of  variance  results.     Using  the  F  ratio  as  an 
indicator,  the  maximally  distinctive  EnR  partition  appeared  to  be  heavily  influ- 
enced by  the  UPOI  measure  (an  urban-rural  indicator)  and  case  mix.  The 
maximally  distinctive,  first  suboptimal,  and  second  suboptimal  EnU  partitions 
are  also  heavily  influenced  by  the  UPOI  measure  as  well  as  the  case  mix  severity 
factor.     However,  in  the  third  suboptimal  EnU  partition,  the  influence  of  the  case 
mix  severity  factor  appears  to  decline.     This  partition  apparently  is  determined 
mostly  by  an  urban-rural  indicator  and  input  prices. 

With  one  exception,  the  remaining  comparisons  remained  significantly  below  0.500, 
supporting  the  conjecture  that  once  the  EnR  partitions  begin  to  represent  a 
more  complex  weight  scheme,  the  two  partition  structures  no  longer  correspond. 
Comparisons  between  the  EnR  and  ExR  approaches  (Table  4.18),  and  the  EnU  and 
ExR  approaches  (Table  4.19)  show  similar  patterns;  the  most  distinctive  partitions 
appear  to  be  significantly  related,  but  as  the  number  of  groups  increase  in  the 
suboptimal  partitions,  the  partition  comparison  measure  decreases. 

While  some  explanation  may  be  gathered  by  examining  the  factors  and  the  analyses 
of  variance,  this  phenomenon  in  general  (i.e.,  a  lower  partition  comparison 
measure  between  partitions  with  larger  number  of  groups)  may  be  explained  by 
examining  the  nature  of  the  problem  in  general.     The  comparison  between  the 
optimal  ExR  partition  (10  groups)  and  the  third  suboptimal  EnU  partition 
(26  groups)  in  Table  4.19  provides  an  example.     The  low  A  measure  (0.368)  would 
indicate  that  for  the  two  partitions,  it  may  be  possible  to  combine  the  groups 
in  the  EnU  partition  into  10  groups  in  such  a  way  that  the  resultant  partition 
would  be  identical  to  the  optimal  ExR  partition.     In  other  words,  while  these 
two  partitions  appear  to  be  very  dissimilar,  it  may  be  that  the  partitions  are 
quite  consistent  in  the  sense  that  one  partition  maps  onto  the  other.  This 
problem  of  finding  the  partition  mapping  which  maximizes  A  will  be  studied  in 
future  work. 
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TABLE  4.17 


EnR  Approach 

Optimal 

1st  Suboptimal 

2nd  Suboptimal 

(12  groups) 

(27  groups) 

(35  groups) 

Optimal 
(12  groups) 

0.811 

0.286 

0.170 

1st  Suboptimal 
(14  groups) 

0.811 

0.278 

0.160 

EnU 

Approach 

2nd  Suboptimal 
(16  groups) 

0.844 

0.281 

0.161 

3rd  Suboptimal 
(26  groups) 

0.462 

0.611 

0.334 

A  Partition  Comparison  Measure 
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TABLE  4.18 

Partition  Comparison; 


EnR  Versus  ExR* 


EnR  Approach 

Optimal  1st  Suboptimal  2nd  Suboptimal 
(13  groups)  (27  groups)        (35  groups) 


ExR 

Approach 


Optimal 
(10  groups) 

0.662 

0.294 

0.197 

1st  Suboptimal 
(13  groups) 

0.666 

0.279 

0.174 

2nd  Suboptimal 
(15  groups) 

0.655 

0.261 

0.163 

3rd  Suboptimal 
(17  groups) 

0.462 

0.321 

0.269 

A  Partition  Comparison  Measure 
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TABLE  4.19: 

Partition  Comparison:     EnU  Versus  ExR* 


ExR 

Approach 


EnU  Approach 
Optimal      1st  Subopt  2nd  Subopt  3rd  Subopt 


(12 

groups) 

(14 

groups) 

(16  groups) 

(26  g 

Optimal 
(10  groups) 

0 

.742 

0 

.740 

0.720 

0.368 

1st  Subopt 
(13  groups) 

0 

.745 

0 

.735 

0.719 

0.357 

2nd  Subopt 
(15  groups) 

0 

.723 

0 

.732 

0.712 

0.355 

3rd  Subopt 
(17  groups) 

0 

.486 

0 

.484 

0.475 

0.489 

A  Partition  Comparison  Measure 
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CHAPTER  FIVE 

SUMMARY  AND  CONCLUSIONS 

The  purpose  of  this  study  has  been  to  provide  an  initial  investigation  into 
the  feasibility  of  a  hospital  classification  methodology  that  could  be  used 
as  part  of  a  prospective  payment  system  in  the  U.  S.  hospital  industry.  The 
aim  of  this  chapter  is  to  briefly  summarize  the  major  conclusions  suggested 
by  our  work,  to  reiterate  the  qualifications  that  must  accompany  these  con- 
clusions, and  to  suggest  directions  for  further  investigation.     It  is  important 
to  note  that  the  primary  focus  of  this  initial  effort  was  on  the  development  of 
the  conceptual  and  statistical  methodology;  while  we  have  reported  the  results 
of  an  empirical  analysis  that  is  based  on  our  proposed  methodology,  severe 
data  limitations  restrict  the  usefulness  of  the  empirical  results  to  answering 
particular  questions  about  the  characteristics  of  the  system  and  suggesting 
potential  problem  areas  for  further  study.     The  final  groupings  reported  in 
Chapter  4,  therefore,  must  be  regarded  as  illustrative.     They  are  not  intended 
to  reflect  our  view  of  the  most  appropriate  industry  classification  for  rate 
determination  or  rate  review  purposes. 

I,     Conceptual  Framework 

Perhaps  the  most  important  conclusion  that  can  be  drawn  from  the  conceptual 
development  in  Chapter  2  is  that  the  selection  of  the  criteria  on  which  hospital 
similarity  is  to  be  based  is  exceedingly  important.    While  most  of  the  attention 
in  the  literature  has  focused  on  the  selection  of  the  statistical  technique 
to  be  used  (e.g.  factor  analysis,  cross  classification,  cluster  analysis  etc.), 
the  use  of  inappropriate  classification  variables  will  lead  to  problems  and 
biases  equally  as  severe  as  those  resulting  from  a  poor  choice  of  statistical 
technique. 

Since  prospectively  determined  reimbursement  guidelines  (or  limits)  function 
as  surrogate  prices  in  the  market  for  hospital  services,  any  conclusions  regard- 
ing the  appropriate  choice  of  criteria  draw  heavily  on  economic  price  theory. 
It  was  concluded  that  three  types  of  variables  can  be  identified:     those  that 
are  outside  the  control  of  the  hospital  (e.g.  input  prices),  those  that  are 
within  the  control  of  the  hospital  (e.g.  length  of  stay),  and  those  variables 
that  are  of  specific  policy  interest  (e.g.  the  extent  of  teaching  programs, 
the  quality  of  care) .     In  order  to  assure  the  long  run  financial  health  of 
efficient  institutions  in  the  industry,  variables  of  the  first  type  must  be 
included  in  the  set  of  classification  criteria.     Inclusion  of  variables  of  the 
second  type  (either  directly  or  as  surrogates  for  the  type  1  variables)  can  be 
expected  to  promote  changes  in  institutional  behavior  contrary  to  the  cost 
containment  objectives  of  the  system. 

With  respect  to  the  policy  variables,  the  inclusion  or  exclusion  decision  must 
be  made  on  the  basis  of  the  desires  of  the  policy  makers  and  the  other  policy 
instruments  at  hand.     Thus,  a  program  intended  to  encourage  teaching  programs 
should  include  a  teaching  variable  in  the  similarity  criteria  so  that  an 
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institution  would  never  be  penalized  for  the  additional  cost  of  efficiently- 
run  teaching  programs.        A  program  uninterested  in  the  promotion  of  above 
average  quality  of  care  would  exclude  an  explicit  quality  variable,  and  rely 
on  licensure  standards  or  other  constraints  to  monitor  the  average. 

Further,   the  specific  set  of  classification  variables  that  is  appropriate 
depends  on  not  only  program  policy  but  also  on  other  elements  of  program 
design.     The  question  of  which  hospitals  should  be  grouped  together  cannot 
be  answered  without  first  answering  the  questions:    What  is  the  proposed 
payment  or  review  unit  (e.g.,  per  service,  per  day,  per  admission,  per  unit 
of  time)?    What  costs  are  the  targets  of  the  control  program  (e.g.,  routine 
costs,  total  costs)?    How  is  the  rate  or  limit  to  be  determined  within  groups? 

With  respect  to  variable  weighting,  it  was  concluded  that  the  appropriate 
theoretical  answer  is  of  little  practical  use:    variables  that  affect  cost 
should  be  weighted  in  the  classification  system  according  to  their  coefficients 
in  the  "efficient"  industry  cost  function.     Given  that  the  parameters  of  this 
function  are  not  known  (and  can  be  estimated  only  with  data  on  existing  industry 
practice),  this  result  only  provides  guidance  for  a  very  imperfect  second  best 
approach. 

Finally,   the  issue  of  system  validation  was  discussed.     Because  the  methodology 
proposed  in  this  study  attempts  to  classify  hospitals  based  on  efficient 
performance  rather  than  actual  performance,  measures  of  statistical  tidiness 
of  the  resulting  hospital  groups  will  be  good  validation  tools  of  this  system 
only  if  the  industry  is  currently  operating  close  to  the  optimal  position. 


II.     Statistical  Methodology 

Chapter  3  began  with  a  discussion  of  the  selection  of  the  grouping  methodology. 
Cluster  analysis  was  selected  as  the  appropriate  grouping  methodology  on  two 
counts.     First,  unlike  AID,  discriminant  analysis  or  regression  analysis, 
cluster  analysis  does  not  require  the  specification  of  a  dependent  variable. 
Further,  unlike  regression  analysis,  factor  analysis  or  discriminant  analysis, 
this  technique  does  not  impose  a  particular  functional  form  on  the  relationship 
among  the  variables. 

A  major  conclusion  of  this  section  of  the  study  was  that  the  preparation  of 
the  input  variables  for  the  cluster  analysis  must  proceed  with  care.  While 
the  discussion  in  Chapter  2  could  offer  only  weak  guidelines  for  the  assignment 
of  appropriate  weights  to  the  input  variables,  it  was  observed  in  Chapter  3 
that  implicit  (and  unknown)  weights  may  be  assigned  unintentionally  if  the 
variables  are  multicollinear .     The  suggested  method  of  removing  these  implicit 
weights  was  to  factor  analyze  the  input  variables  and  use  only  the  standardized 
factor  scores  in  the  computation  of  the  similarity  matrix.     In  this  way,  explicit 
(and  known)  weights  can  be  added,  and  the  sensitivity  of  the  resulting  group 
structure  to  changes  in  these  weights  can  be  tested. 

After  concluding  that  agglomerative  hierarchical  clustering  algorithms  avoid 
the  computational  problems  of  divisive  algorithms  and  provide  more  information 
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than  Iterative  algorithms,  several  well-known  agglomerative  algorithms  were 
discussed.     Since  it  is  known  that  various  algorithms  tend  to  search  for  clusters 
of  different  sizes  (e.g.,  the  complete  linkage  algorithm  tends  to  find  clusters 
of  spherical  shap  while  the  single  linkage  algorithm  tends  to  find  serpentine 
clusters),  a  conservative  method  of  combining  the  results  from  a  number  of 
individual  algorithms  was  developed.     This  composite  approach  also  offered  the 
advantage  that  no  single  algorithm  had  to  be  selected  and  justified — a  tenuous 
task  in  most  cases  as  empirically  demonstrated  in  the  fourth  chapter. 

A  method  was  also  developed  for  analyzing  the  composite  dendrogram  resulting 
from  this  combined  approach.     It  was  shown  how  a  natural  measure  called  expected 
distinctiveness  could  be  used  for  comparing  partitions  from  the  composite 
dendrogram.     An  easily  implemented  algorithm  based  on  expected  distinctiveness 
was  explained  and  illustrated.     This  algorithm  can  be  used  in  choosing  among 
the  various  partitions  found. 

Finally,  in  Chapter  Three,  a  statistic  for  measuring  the  similarity  between 
group  structures  was  developed.     This  statistic,  based  on  the  null  hypothesis 
of  random  grouping  and  equally  likely  groups,  was  useful  for  statistically 
testing  the  sensitivity  of  the  classification  system  to  changes  in  various 
parameters  and  procedures. 


III.     Empirical  Results 

It  was  stated  at  the  beginning  of  this  chapter  that  the  data  set  available  in 
this  study  precluded  a  definitive  test  of  the  characteristics  of  the  classification 
system  proposed  here.     However,  a  number  of  empirical  results  were  obtained  that 
have  interesting  implications  for  further  conceptual  development,  data  collection, 
and  empirical  analysis. 

A  brief  summary  of  the  empirical  approach  is  presented  here,  followed  by  a 
discussion  of  the  most  useful  empirical  results. 

The  first  step  in  the  empirical  analysis  was  to  identify  measures  within  the 
available  data  set  that  most  closely  captured  the  spirit  of  the  grouping  variables 
selected  in  Chapter  2.     The  most  severe  problem  was  with  the  case  mix  variable. 
The  solution  finally  adopted  was  to  choose  two  alternative  approaches  (the 
"endogenous"  approach  and  the  "exogenous"  approach),  both  of  which  violated  the 
principles  established  in  Chapter  2.     Unfortunately,  in  the  absense  of  more  direct 
measures  no  estimate  of  the  bias  introduced  by  these  approaches  was  possible. 

The  input  measures  from  these  two  approaches  plus  a  third  (representing  the  essense 
of  the  current  HCFA  approach)  were  factor  analyzed.     The  factor  scores  were  then 
assigned  weights.     Once  again,   the  absense  of  an  optimal  approach  resulted  in  ^ 
the  choice  of  two  alternatives:     the  assignment  of  unit  weights  to  each  variable, 
and  the  assignment  of  regression  weights.     In  the  latter  case,  the  weights  repre- 
sented the  coefficients  of  each  factor  in  a  regression  with  cost  per  case  as  the 
dependent  variable.     As  noted  above,   this  provides  an  empirical  estimate  of  the 
cost  function. 


For  variables  represented  by  more  than  one  factor,  each  factor  was 
assigned  an  equal  proportion  of  the  unit  weight. 
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Cluster  analysis  was  then  performed  on  six  sets  of  measures  representing  the 
combination  of  three  variable  sets  (endogenous,  exogenous,  and  HCFA)  and  two 
weighting  schemes.     The  analysis  was  run  first  on  a  sample  of  194  hospitals, 
combining  the  results  of  six  algorithms  into  a  composite  dendrogram.     In  the 
analysis  of  the  complete  sample  of  1070  hospitals,  only  three  variable  sets 
were  analyzed:     endogenous  with  regression  weights,  endogenous  with  unit 
weights,  and  exogenous  with  regression  weights.     Three  algorithms  were  combined 
in  the  analysis  of  the  first  variable  set;  only  the  average  linkage  between 
groups  algorithm  was  used  with  the  latter  two.     Finally,  a  number  of  tests  were 
performed  on  the  resulting  group  structures  to  investigate  their  statistical 
characteristics. 

A  number  of  notable  conclusions  were  suggested  by  these  results.     First,  the 
regression  analysis  that  was  performed  to  estimate  factor  score  weights  was 
only  moderately  successful.     The  explanatory  power  of  the  equation  was  never 
as  high  as  50  percent,  and  some  of  the  signs  of  the  coefficients  were  counter- 
intuitive.    It  is  possible  that  this  was  primarily  a  data  problem,  but  further 
analysis  is  warranted  here. 

Second,  the  absolute  cophenetic  correlation  coefficients  calculated  for  the 
six  individual  algorithms  indicate  that  no  single  algorithm  consistently  out- 
performs any  other.     Thus,  the  use  of  a  composite  dendrogram  appears  to  be 
supported. 

In  general,  the  measure  of  expected  distinctiveness  proved  to  be  a  good  indi- 
cator of  partition  "quality."    In  addition,  it  appeared  to  provide  a  fairly 
accurate  measure  of  the  ability  of  the  partitions  to  be  discriminated  by  linear 
functions:     the  correlation  between  the  expected  distinctiveness  measure  and  the 
percentage  of  hospitals  correctly  classified  by  a  linear  discriminant  analysis 
was  consistently  high  for  both  the  small  sample  and  the  full  sample  results. 
In  both  cases,  the  hospital  clusters  appear  to  be  well  defined  by  linear  functions. 
The  latter  results  suggest  that  further  analysis  is  warranted  to  investigate 
the  possibility  that  the  entire  hospital  population  might  successfully  be 
classified  based  on  the  discriminant  functions  determined  from  a  sample.  It 
is  imperative  to  note,  however,  that  these  results  may  be  specific  to  this  data 
set  and  may  not  hold  when  improved  variable  measures  and  more  current  data  are 
used. 

Analysis  of  variance  tests  were  performed  to  determine  the  sensitivity  of  the 
final  group  structure  to  changes  in  the  factor  score  weights.     In  the  small 
sample,  the  ANOVA  results  verified  that  factors  assigned  a  zero  weight  in  the 
regression  analysis  (i.e.,  those  factors  whose  coefficients  were  statistically 
insignificant)  had  no  effect  on  final  group  structure  while  factors  with  positive 
weights  all  had  a  significant  impact  on  the  outcome.     This  result  suggests  that 
the  clusters  are  somewhat  sensitive  to  weight  changes,  although  a  more  systema- 
tic analysis  would  have  to  be  performed  to  determine  the  extent  to  which  this 
is  true. 

The  small  sample  ANOVA  results  were  supported  by  the  full  sample  analysis  where 
a  strong  positive  relationship  was  found  between  the  weights  attached  to  each 
factor  and  their  respective  F  values. 
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A  partial  test  of  cluster  sensitivity  to  changes  in  the  variables  (as  opposed 
to  variable  weights)  is  afforded  by  the  partition  comparison  measure,  A.  When 
the  partitions  produced  by  the  endogenous  approach  and  the  exogenous  approach 
(both  with  regression  weights)  were  compared  using  the  A  statistic,  a  reason- 
ably high  degree  of  similarity  was  found,  although  the  degree  of  similarity 
was  affected  somewhat  by  the  number  of  groups  being  compared. 

The  partition  comparison  measure  was  also  used  to  provide  a  second  test  of 
sensitivity  to  changes  in  variable  weights.     The  results  of  this  test  were 
similar  to  those  of  the  previous  partition  comparison  tests,  and  thus  incon- 
sistent with  the  ANOVA  results.     Further  investigation  is  necessary  to  resolve 
this  apparent  inconsistency. 

Finally,  the  final  group  structures  suggest  the  presence  of  a  number  of 
hospitals  that  repeatedly  fall  into  groups  by  themselves.     The  number  of  these 
isolates  is  small  (approximately  two  percent  of  the  complete  sample) ,  and  a 
detailed  analysis  of  their  characteristics  suggested  that  they  indeed  repre- 
sent special  cases  (an  example  is  a  large  teaching  institution  that  is  located 
in  a  rural  area).     To  the  extent  that  these  hospitals  are  atypical,  it  is 
desirable  that  the  classification  system  respond  accordingly. 

IV.     Directions  for  Further  Research 

The  purpose  of  this  study  was  to  develop  the  conceptual  and  methodological 
underpinnings  for  a  hospital  classification  system  useful  in  a  prospective 
rate  determination  or  review  program.     In  the  process  of  this  development, 
and  in  the  initial  tests  of  its  feasibility,  a  number  of  questions  have 
arisen  that  warrant  further  investigation. 

The  most  obvious  extension  is  the  development  of  better  empirical  measures  of 
the  variables.     Case  mix  and  case  mix  severity  are  at  the  top  of  the  list,  but 
the  measures  for  input  prices  and  rural  markets  used  in  this  study  also  leave 
room  for  improvement.     Once  better  measures  are  obtained  for  a  sample  of 
hospitals,  new  clusters  can  be  generated  and  compared  with  the  present  results. 
These  comparisons  will  then  be  helpful  in  making  decisions  about  the  benefits 
of  larger-scale  data  collection  efforts  for  this  purpose. 

Further  investigation  of  the  sensitivity  of  the  clusters  to  changes  in  varia- 
ble weights  is  also  appropriate.     The  tests  reported  here  are  only  suggestive, 
since  their  results  appear  to  be  somewhat  inconsistent. 

Cluster  stability  over  time  is  important  for  the  long  term  maintenance  of  a 
payment  or  review  system  based  on  hospital  classification  methods.     Since  the 
data  set  available  to  the  study  contained  only  1973  figures,  questions  of 
cluster  stability  over  time  could  not  be  addressed. 

The  importance  of  a  reliable  measure  of  partition  comparison  in  these  exten- 
sions is  obvious.     While  such  a  measure  was  developed  and  used  in  this  study, 
future  work  should  investigate  other  possibilities  and  compare  them  with  the 
study  statistic. 
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V.     Concluding  Remarks 

It  must  be  remembered  that  the  original  rationale  for  developing  this  hospital 
classification  system  was  as  an  integral  part  of  a  prospective  payment  deter- 
mination or  review  program.     By  design,  this  study  has  not  addressed  the 
important  issue  of  exactly  how  this  integration  takes  place.     That  is,  how  would 
hospital  clusters  be  used  to  set  or  review  rates?    Upon  what  payment  unit  should 
payment  or  review  be  based?    Which  costs  should  be  included?    All  of  these 
questions  must  be  answered  before  an  effective  program  can  be  implemented. 

The  output  characteristics  of  an  industry,  its  long  run  stability,  and  its 
response  to  technological  change  are  all  determined  by  how  the  industry  is 
financed.     The  institution  of  a  nationwide  prospective  payment  determination 
or  review  system  for  even  a  segment  of  the  hospital  industry  represents  a 
substantial  change  from  current  practice.     If  this  change  is  to  be  positive 
and  to  achieve  the  objectives  implicit  in  its  development,  it  must  be  carefully 
planned  and  carefully  implemented.     It  is  hoped  that  this  study  can  provide  a 
first  step  in  that  direction. 
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APPENDIX  C 


PROGRAM  SET  FOR  SMALL  SAMPLES 
(For  Sets  With  Less  Than  350  Objects/Variables) 

The  small  sample  program  set  currently  in  use  completes  the  clustering 
procedure — from  distance  matrix  computations  to  composite  dendrogram — in  one 
job.     Control  over  the  clustering  procedure  is  provided  by  a  file  of  parameters 
that  is  read  at  the  first  step  and  by  alterations  totthe  operating  system 
control  cards.    The  parameter  file  contains  information  on  the  number  of  entities 
the  number  of  variables,  input  format  of  the  data,  and  optional  weights  used  in 
computing  distances  (either  Euclidean  or  squared  Euclidean) .     Data  input  con- 
sists of  the  variables  and  an  optional  file  of  labels  identifying  each  entity. 
The  control  cards  are  altered  to  specify  such  information  as  the  names  of  the 
data  files  and  the  clustering  methods  to  be  used  in  finding  the  composite 
dendrogram.     The  actual  programs  are  kept  in  compiled  form  on  a  file  and  are 
fetched  and  executed  as  needed  by  the  procedure.     The  steps  in  a  typical  run  are 
as  follows: 

1)  Execute  program  EUCDP,  which  reads  the  input  parameter  file,  prepares 
the  file  for  the  other  job  steps,  and  computes  the  matrix  of  distances 
between  all  pairs  of  entities.     The  distance  matrix,  of  which  only 
the  lower-lef thand  part  is  written,  is  based  on  the  variable  values 
and  variable  weights. 

2)  This  step  is  executed  once  for  each  clustering  method,  and  contains 
three  sub-steps: 

a)     Program  RUNCLUS  and  associated  subroutines  perform  the  clustering. 
The  method  used  is  determined  by  the  version  of  subroutine  METHOD 
that  is  loaded.     The  program  writes  a  vector  of  the  joining 
distances  for  each  merge. 


C-2 


b)  Program  CLUSVEC  converts  the  joining  distance  vector  into  a  full 
lower  left  hand  distance  matrix. 

c)  System  routine  SORTMRG  sorts  the  elements  of  the  joining  distance 
matrix  so  that  the  elements  are  in  the  proper  order. 

3)    Program  CORSTD  uses  the  original  distance  matrix  and  the  joining 
distance  matrix  from  each  method  to  produce  a  composite  joining 
distance  matrix.     Each  pairwise  composite  joining  distance  is  the 
combination  of  a  pair's  output  distances  standardized  and  summed 
across  all  methods,  weighted  by  the  square  of  the  Spearman  correlation 
coefficients  (i.e.,  the  cophenetic  correlation  coefficient)  between 
the  original  distance  matrix  and  method  joining  distance  matrices. 
The  program  calculates  the  cophenetic  correlation  coefficients, 
standardizes  the  matrices,  and  as  an  option  can  write  the  correlation 
and  standardized  matrices  for  further  analysis. 

A)    The  composite  dendrogram  is  produced  by  running  the  complete  linkage 
clustering  method  on  the  composite  joining  distance  matrix. 

The  job  that  compiles  the  programs  (file  1  of  new  program  tape)  compiles 
the  following  main  programs  and  subroutines  and  stores  them  on  a  permanent 
file,  for  use  by  the  clustering  job: 

EUCDP      program  for  step  1  of  cluster  run 


RUNCLUS 


CNTRL 


LFIND 


main  program  and  common  subroutines  used  for 


CLSTR 


clustering  steps  2  and  4. 


MTXIN 


TREE 


C-3 


METHOD 

single  linkage 

METHOD 

complete  linkage 

METHOD 

average  between 

METHOD 

average  within 

METHOD 

median 

METHOD 

centroid  sorting 

METHOD 

Ward 1 s 

CLUSVEC 

program  for 

CORSTD  1 

1 

SORT 

^       main  program 

one  of  these  subroutines  is  used  in  each 
execution  of  steps  2a  and  4  to  determine 
the  clustering  method.     Normally  we  execute 
step  2  six  times,  using  the  complete  link- 
age through  Ward's  methods,  and  use  the 
complete  linkage  method  in  step  4. 


The  job  that  performs  the  clustering  (file  2  of  new  program  \  ape)  consists  of 
operating  systems  control  cards,  the  input  parameters,  and  sets  of  SORT/MERGE 
directives  for  the  executions  of  step  2c.     The  job  is  set  up  to  read  the  data 
and  entity  labels  from  permanent  (disk)  files.     It  clusters  using  six  methods, 
omitting  the  single  linkage  method,  and  uses  the  complete  linkage  method  for 
the  composite  cluster.     The  composite  distance  matrix  is  saved  on  a  permanent 
file.     If  the  job  terminates  normally,  only  the  composite  dendrogram  and  merge 
information  are  printed;  if  the  job  terminates  abnormally,  the  output  (dendrograms) 
from  all  methods  executed  are  printed  and  joining  distance  matrices  are  saved 
on  permanent  files. 
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FILE  1  OF  PROGRAM  TAPE    —    PROGRAM  COMPILATION  JOB 


3BOBL,  TVOt PI. 
ACCOUNT. 

rO^YSPf INPUT* njM, SO ,CO*B. 
REWlNO,CUM. 
FTN,OPT=?,I»DUM. 
REWIND, LGO. 

CATALOG»LGr).RCBF2»ID=RrBF2,RP.14. 
T  TEMI ZE,LG0. 
AUDIT. 
♦  EnR 

PROGRAM   EUCOP(MPUT   *   ?03B.    TAP=l,    TAP=2.    OUT°UT    »  '039. 
I   RCPAR    -    203«»    TAPE3   *    <C»AP,    T  A°  E  6    -    IN°UT»    T  A  P  E6    ■  OUTPUT) 

common  xtiooo) 

DIMENSION   0(10),    WT(32)»    RC1<7),    SUS(2),    IRDK2)*  IR02(2) 

INTEGER   F  M  T  ( tM  ,    CMTK?),    cmT2(2).    I0CS(2.  3) 

0  ATA    1501,    IB02    /    2*1H   ,    10H(P=AD   UN  T  T »    V-IL    EOF)  / 

DATA   =MT    /   8W(0F10.6),    7    ♦    1H      /,    UT   /   3'    *    1,0  / 

DATA   F1T1   /    10H(eeiO.6/OF,  5H10.6W* 

1  F-T2    /    10H(0(8P10.6/»    9H),0F10.6)  / 

0  A  T  A   SUS    /    ?H   SQUARED,    9HUNSCUARED    /.    MfeMD    /    4LMEMP  / 
DATA   IOCS   /   1 CH  (  ALL   MCRGE,    'HS)»    l^H  (  AUTJ-CUTO ,    3  H c  ^  )  ,    9H»  HIGHEST 

1    ,    5HMEPGF  / 


C 

C  :01PIJT?    L.L.H.    TRIANGULAR    MATRIX    OF    EUCL  IOEAN  DISTANCES... 

C  READ   FRO*   FILE    5    ( IN°UT) 

C  PARAMETERS,    FORMAT  (6U.F4. 2.  314,  7A4)  ..  . 

C  IUR    .ME,   0   TO   3PA0    IN   WEIGHT   VECTOR,    ELSE  USE  UNIT  wTS. 

C  NH   m   NU^°ER   OF    ENTITIES    (IF    <    I,    p  =  AD  UNTIL  EOF) 

C  NV    *   NU  MB  FR    OF    VARIABLES  CACTPRS 

C  IFMT    ,  N  £ ,    0   T  n   R  c  40    IN   INPUT   FORMAT   FOR   ROW  OF   SC03=S,    c  (_  s  c 

C  US^   DEFAULT   FORMAT    (°F10.6)»    e0R   A  ROW 

C  T*39T    . NE .   J    TO   OUTPUT   UNSQUAR  FD    EUCL  IOEAN   01  STANCES 

C  .60.    0    OR    PLANK    TC  OUTPUT   SQUARED   EUCL IOEAN  JTCTANCES 

C  \METH   -    NUMBER   CF   CLUSTERING    METHODS    (USEO   Rf   CnMBVEC    A  NO 

r  rOMPOM)...    QFFAULT    (0   OR    BLANK)    TS  6 

C  SAM   *    FRACTION   SAM.0!.  E    FOR  C^PHFNFTIC  CORRELATIONS... 

C  GF    1.00   USE   ALL  PAIRS 

C  L  F  0   (DEFAULT)    SELECT   SAMPLE   TO   USE    A  B  CUT   600  PSIRS 

C  ICi    .EO.    0    ]P   BLANK. ..USE   ALL   MERGES    TO  CONSTRUCT  DENOCGP^S 

C  .(St.   O...USE   JOINING  LEVELS   UP   TO   TH  AT   OF    ICS-TH  "E^GE 

C  .LT.    O...USE    AUTO-CUTOFF    OPTION   FOR  PENQOGRAM 

C  TSJL    .  N  F .   O...WPTTE   CUE  0^   COPHEN^TIC  CORRELATIONS   AND  STAN 

C  DARDIZED   ORIGINAL   AND    JOINING   DISTANCES   ON   FILE  0 

C  .EO.   0  OR   BLANK. • . DON*  T   WITE   FILE  9 

C  L3LFLG    .EO.    0          BLANK,.  .NO   ENTITY    LABELS    APE  SL'PPLIEO 

C  .NE.    0...cILE    9    TS   SUPPLIFD.    WITH   LAPELS    IN   F 0°  M  A  T 

C  ( //3X.5A5. A4) 

C  MN4M[1..NMETH]    «   OPTIONAL    ARRAY   OF   mpTHOD   NAMES,    1-4  CMA9- 

C  ACTERS    EACH,    LtcT- JUSTTF I  EO  WITH  NO   EMBETOED  BLA*'*S 

C  CARD(S)    WITH   THE   NV  WEIGHTS,   FORMAT   CM0.6),    T  F   IU«>   ,  NF .  0 

C  CARD   WITH    INPUT   FORMAT,    IF    I F  MT    .NE.  0 

C  NOTE:      THF   F 0 R M A T  *UST   SPECIFY   EXPLICITLY   ALL   THE   SCORf*  FOR 

C  ONE    ENTITY,    WITH  NO    IMPLICIT   REPTTITION    IN   THE    FORMAT,  E.G. 

C  13   STANDARDIZED   SCORES   F R 0M   AN   SPSS   RUN  WOULO   REOUI°F  T"E 

C  FORMAT    (16X,«Fa.5/l(SX,5FB.5).THE    DEFAULT   FORMAT   OF  (°F10.6), 

C  IF    SPFCTFIFO,    IS    MODTFIFP   TO   REFLECT   THE  NUMBER   nf  SCOPES. 

C  E.G.   FOR   13   SCORES  WITH  DEFAULT  c  OR  MA  T ,   THE  fqrmAt  rcNEP4TcD 

C  WOULD   BE  (°F10.6/5F10.6) 

C  ^  A  5  0   WITH    THE   PROBLEM   DESCRIPTION   TN  COL.    1   -  °0 

C  READ   cROM   PRC   1 ...  SCORES ,    ROWWISE,    TN   SPECTFTE0  format 

C  WRITE    TO   FILE    2    THE   DISTANCE   MATRIX,    IN  CONTINUOUS  (10F10.M 

C  P  r  R  M  a  T 
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C  WfclTF   TO   FILE   OCPAR   T"E   CAPO   USED   BY   EUCDP   AND   C  CARDS   e OR 

C  THE   CLUSTER  PACKAGE. 

C  THIS    »RJS*AM   WAS   WRITTEN   BY   ROBERT   B.    LEDINGHAM,    UNIV.    OF  WASH. 

C 

C  READ   PARAMETERS    AND  PUT   TOGETHER  FORMAT 

r 

READ    (5,    100)    IUR»    NH,    NV»    I^MT*    ISQRT,    NMETH*    SAM,    tCS,    I S  J  L  , 
1   L8LFLG,  RC1 
ISORT   *    MIN0(MAX0(0,    ISORT).  1) 
IF    (    IA«S<  ICS)    ,F0.    0    )    ICS   •  0 
J    *  1 

IF    (    ICS    .LT.   0    )    J   -  2 
IF    (    ICS    .GT.    0    )    J   -  3 
I c    (    NH    .  GE  •    1    )    GO    TO  5 
IRDl(l)    »  IR02(1) 
I  R01 (  2  )    -  TRD2(2) 
5   WRITE(6»    120)   NH,    TRD1.   NV»    SUS  (  I  S  OR  T  «■  I )  »    MMETH.    RC1,    SAM,    ICS  • 
1    T  DC  S ( 1 ,    J),    IDCS(2,    J),    I  S  J  L  »  LBIFLG 
TF    (    IUR    . N  E .    0    )    PFAD    (5.    101)    (WT(I),    I    x    \,  NV) 
IF    (    I  F  MT    ,cq,    o    )    GO   TO  10 
READ    (5.    1C2)  FMT 
G °   TO  15 

10  IF    (    NV    .GE.    9    )    GO    TO  11 
F"T(1)    «    FMT(l)    ♦    ISHTFT(NV,  <.8) 
GO   TO  15 

11  NV9   ■    (NV   -  1)    /  8 
NVL    ■  NV  -  NVa   *  9 

IF    (    NV 9    .GT.    1    )    GO   TO  12 
FMT(l)    »    FMTl(l)    ♦    ISHIFT(NVL»  h\ 
PMT(2)    «   FMT1 (2) 
GO   TO  15 

12  FMI(1)    ■    FMT2(1)    ♦    I^HIFTJNVS,    <.  a  ) 
F  M  T ( 2 )    -    FMT2(2)    ♦    ISHIFT(NVL»  ^2) 

15    WRITE(6.    121)  FMT 

r 

C  R r  AD   SCrIRcS..,IP  N  H   IS   S°ECTFIFO»    GET   SUFFICIENT   "^oby    &nd  OFAO. 

C  IF    NH  NOT   SPECIFIED,    RFAD   A   CASE    AT   A   TIME    UNTIL    EOF.      GET   1EMQR Y 

C  AS   NEEDED    IN    ?K   "LOCKS.    STARTING    WITH   1C00  WOODS. 

r 

IF    (    NH    .LT.    1    )    G  T    TO  '2 
IW    *    NV    +  NH 

IF    (    H    .LT.    1001    )    GQ   TO   2  0 

"t*CRY    ■    IS.HI  FT  (  L3CF  (  X  (  1  >  )    ♦    IW,  30) 

CALL    o  AP  1  (  LnCF  ( (IbMOR  Y  )    +  MEMP> 
'0   RF  ^0    ( I .    FMT)    ( X (  I  ) ,    I    -    1 ,    I W) 

GO   TO  2  3 
2?   NH   -    [  J«U    »  0 

IW   »  1000 
25   I  WIN   »    I  WAX    ♦  1 

T  WMAX    >    IwMAX    ♦  NV 
NH   ■   NH  ♦  1 

IF    (    UMAX    .LT.    IW    )    GO   TO  27 
IW   ■    IW        2  0<.a 

MEMORY   »    I*HIFT(L3CF(X(1) )    ♦    IW.  30) 

CALL    « AP  1  ( L HC F ( MEMORY )    *  MCMP) 
27   READ    (1,    FMT)    (X(I),    I    ■    IWMIN,    I W  M  4  X ) 

TF    (    EOF ( 1 )    )    2*t  25 
?p    N«    ■    NH   - '  1 

IW   «    IWMIN.  -  1 

WRITE(6,    126)  NH 
?<5    1MAX    ■  NV 

r 

C  CCMPUTE    Thc    LOWER   T  R I  ANGULAR   MATRIX. ..IF   NON-UNIT  WEIGHTS,  USE 

C  Them    (CODE    THi>ni,&H   STMT   <»0).      I F   UNIT   WEIGHTS,    LSE   CODE   0 e  STMTS 

C  50   THROUGH    70.      I c    ISORT    *   0,    MATRIX   WILL    BE   UN*  QU.AREQ  niSTAN""ES. 

C 

C  FQR    INNER    L~)0°S   COMPUTING  ONE    ELEMFNT   OF   DIST  MAT. ...IF    Y»  TS 

C  ArfrtAf    HMFNSIONEO    <NV,    NH)    AND   EQUIVALENT   TO   X  ARRAY, 

".  X(I),    I    «    IMIS...IMAX   CORRESPONDS    TO   XP(l,    N+l),    T    *  1...NV 

C  X(MM),    I    .    IMIN...TM4X   CORRESPONDS   TO  XP(T,    m),    \    <  1...NV 

r 

SHM    »    N  H    —  1 

<    -  0 

I  f    (    I 'J*    .  c'C  .    n    )    GO   TO   5  0 


C-6 


WRITE(6»    122)    (WT(I)»  I 


30 

31 
32 


35 


40 
50 


60 


70 
"0 


IF    (    ISQRT    .  EQ  • 
DO  30   T   ■  1»  NV 
IF    (    WT(I)  .LT. 
GO   TO  32 
WRITE<6,  128) 
CONTINUE 

DO  4G  N   ■   1»  NWM 
I*TN   «    MAX    ♦  1 
IMAX    ■   IMAX   ♦  NV 
M*»  ■  0 
DO  40  M  » 
OK   »  0.0 
J   *  0 
00  35   I  » 

Hp '  m  MM  ♦ 
J  =  J  ♦  1 
OK  «  WTU) 
K  ■  K  ♦  1 
IF  (  ISQRT 
D(«)  «  DK 
I F    (    K    ,  L  T 


0  )   GO  TO 


1 . 

32 


NV) 


0.0   )   GO  TO  31 


N 


IMIN, 
1 


IMAX 


*    (X(I)   -  X(MM))    **  2   ♦  DK 


,NE.   0    )    DK   •  SORT(DK) 


10   )    GO  TO  40 


WRTTE(2»    110)  D 

K    «  0 

C  PN  T I NU  E 

GC   TO  <?0 

WPITFJ6,  123) 

DO  70  H  *   1.  NMM 

IMIN   ■    IMAX    ♦  1 

IPAX    •   IMAX   ♦  NV 

HP    a  0 

00   70  *   *    It  N 
DK    ■  0.0 
OP  60   I  » 
HP    s    MM  ♦ 
OK    «    (  X  (  I  ) 
K    >   K    ♦  I 


I  M4  X 


I"TN, 
1 

-    X( «M>  ) 


*+  2   ♦  DK 


IF    (  ISOKT 
D  f  K  )    «  DK 
I T    (    K  .LT, 


,NE.   0   1    DK   «  SQBT(OK) 


10   )   GO   TO  70 


WRITE<2.    liO)  D 
K    ■  0 

CCNTTNUE 

IF    (    K    .  G  T .    0    )    WRITE<2,    110)    (0(T).    I    »    1 »    K ) 


JfiITc  RC  0  AR  FU  =  .  WITH 
T' ON  CARD.    TWO  CREATED 


WR'ITE<3»  100)  TUR. 
I    ISJL.    L^LFLG.  RC1 

WRITE  124)  IUR» 
1    ISJL.   L9LFLG»  RC1 

READ   (5,    102)  F*T 


WR  I  T  E ( 6. 
WRITE  (3» 
WRITE  I 'it 
WPTTt(3» 


90 


125  ) 
102  ) 
127) 
103) 


F  HJ 

PMT 

NH 
NH 


IF    (    IAdS ( L«LFLG) 

WRITE(3»  131) 

YPlTi(6,  129) 
STOP 

WRITE(3»  13?) 

WRITE(6»  130) 
STOP 


FIRST  CARD  RCAD  *Y  EUCDP.  PROBLEM  0CSCRI°- 
PARAMFTEB   CARDS  »    AND   L&0EL/NO  LA9FL  CA&D. 


NH,  NV.  IFMT.  ISORT.  NMETH,  SAP,  ICS. 
NH,    NV,    IFMT,    ISORT.    N*ETH,    c  AM  »  ICS, 


EQ.   0    )    GO  TO  QO 


100  Fn«yAT<6I4,F4.2,3I4,7A4) 

101  F0R*AT<8F1C.6) 
10?  F0RMAT(9A10) 

103   FCRMAT(I3,11H  I   0  2     0   2 /9H < 10F 10  .  6 ) ) 
110  FORMAT( 10F10.6) 

120   FOR«AT(  21H-N'JM«ER   OF    ENTITIES   ■ *  I  4 . 2 X . 2 A10 / 2 2H0NUP « ER   PF  VARIABLES 
I   «»  H/17H00UTPUT  MATRIX  I S. All, 19HEUCL IDE AN  3  T  S  T  ANC  E  S  /  20H0NUMO  E  <*  0 
2F   M  FT  HO  OS   ».I?,3X,7A6/21HOSAMPLING   PARAMETER   »  ,  F  «5  .  2 /2  5H00  FN  OG^R  A" 
3 CUT OF*   LEVEL   « . I  3 » 2 X , 2 Al 0 /2 3H0S . J . L .   OUTPUT   3PTI0N   «,T4,?4H    ( N  nN  7  E 
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*R0  TO  WRITE  F  lie) /22HOEMTTTY  LABEL  OPTION  «,I<t,37H  (MON7680  TO  REA 
50  LABELS  FROM   FILE  9 ) ) 

121  F0RMAT(16H0INPUT   FORMAT  "  »8A10) 

122  F0RMAT(2<iHOwEIGHTS   APE  EXPLICIT..  ./(1X»8F10.6)  ) 

123  FORMAT( 17M0WEIGHTS   ARE  UNTT) 

124  FORMAT! ?9H0RCPA»  FILE    ( PARAMETERS )    . . . / 1H0# 61  4 » F 4 . 2 . 3 1 A » 7 A* ) 

125  F0RMAT(1X,PA10) 

126  F0RMAT(1H0»I*#1*H  ENTITIES  READ) 

127  F0RMAT(1X, 13* 11H   1  0  2     0  2/10H  (10F10.6)) 

123   FORM  A  T ( 69H— *  *  **     W A RN I NS . . . YOU  HAVE   UNSOUAREO   DISTANCE    WTTH   N  EG  AT  I 
IVE   -EIGHTS. . . /33H   ♦***     SOME   DISTANCES   MAY   BE  IMAGINARY//) 

129  FQR  M  A  T (  16H  LABELS   ON   T A°  E9) 

130  eORM AT( 5H  NOLB) 

131  C0B"AT( 15HLA8ELS   ON   T A  °E  9  ) 

132  FORMAT(^HNOLB) 
END 

♦EOR 

PROGRAM.  SUNCLUS ( INPUT   ■   203B.   OUTPUT   «   2038,    TAPE2*    T A  0  E7,  TA»E9, 
1    T  A  P  c  5    «    INPUT*    TAPE*    "  OUTPUT) 

C 

C  PROGRAM   RUNCLUS    PERFORMS    THE   ciJNCTTONS   OF    P  R  0  G  R  A  M   DRIVER  MENTIONED 

C  IN   THE   COMMENT   SFCTION   OF   SUBROUTINE  CNTRL.    I.E.    IT   ASSIGNS  I/O 

C  UNIT^»   GETS   THE   ^PACE   REQUIRED  FOR   THE   CLUSTERING   ALGORITHM*  AND 

r  CALLS    SUBROUTINE   CNTRL    TO    INITIATE    THE   CLUSTERING  PROCESS. 

C  THIS    VERSION    ALLOCATES   STORAGE   AU  T0M  AT  I C  AL  L  Y   USING    BLANK  COMMON. 

C  PARAMETER    INPUT    IS    AS  FOLLOWS... 

C  CARD   1...SAME    AS    FOR    EUCDP   CARD  1 

C  CARD  2. ..PROBLEM  DESCRIPTION 

C  CARD   3. ..COL    1-3   HAVE   NUMBER   OF    ENTITIES   BEING  CLUSTERED 

C  COL    «»-14   HAVE   CHARACTERS    t    1    0  7     0  2* 

C  CARD  .FORMAT   FOR   DISTANCE    MATRIX,    CURRENTLY  <]0C10.6) 

C  CARD   5. ..IF   COL    1-4    *   4-tNPLB.    NO  LABELS    ARE    R  E  A  C 

C  *    ANYTHING   FLSF*    LABELS    ARE    READ   fPO*   qtf  9 

C  IF   PROGRAM   FUCD°    IS   USED   BEFORE   RUNCLUS »    ^UCOP   WILL   CONSTRUCT  THP 

C  PARAMETER    FILE    C0R    Rt'NCLUS    AND  SUBSEQUENT   PROGRAMS    IN    THE  JOB 

C 

COmmqn   /    ICSCOM    /  ICS 

COMMON  X(l) 

DATA   MEM?    /    4LMEMP  / 

READ    (5*    100)    N  ,  ICS 

IF    (    N   . G E .    5«    )   GO   TO  20 

LIMIT    »     «.  1    *  N 

GO   TP  30 
20  LIMIT   «   N   *    ( N   ♦  25 )    /  2 
30   MFM   .    ISHIFT( LOCF ( X (1  )  )    ♦  LIMIT,  30) 

wRITE(6.    110)    N ,    LIMIT,  ME-M 

CALL   RAPKl  oce<ME")    ♦  ME*0) 

CALL   CNTRL ( * .    1 1  M  I  T  ) 

5T1P 

100   FP<?MAT(t,X,  I<,,20X,  I<.) 

110   F0RMAT(21H-NUMBER   OF    c  N  T  I  T  I  E  5    »,!<../ 2  7H   D  v  N  A  M  I C    STORAGE   REQUIRED  », 
Hfc»7HD   W0RDS/3H   TOTAL    c  I  E  L  D   LENGTH   REQUIRED   *    ,  0  1  6  ,  T  38  ,  1  OH  B  WORDS 

?.  ) 
E  N  0 

*C0R 

SUBROUTINE  C N TR I ( * , L I M I T ) 

r 

C      THIS    SUBROUTINE    ALLOCATES   STORAGE,    READS    INPUT    AND  CONTROLS 

C      EXECUTION   FTP-  A   HIERARCHICAL   CLUSTEPING   JOB   BASED   ON   A  PROVIDED 

C      SIMILARITY  M4TRTX. 

C 

C—  

C      IN°UT  SPECIFICATIONS 
C 

C     CA'O   I     TITLE  CARD 

C      CARD   •>      INFORMATION   FOB    SUBROUTINES   CL^TR    AND  TRFF 

C  COLS      1-  3      NE» NUMBER   0 c    ENTITIES    (SUBJECTS   OR    ATTRIBUTES)    Tn    3  E 

C  CLUSTERED 

C  COLS  5      IS  IC-N«OPTION   F  OR    SIMILARITY  FUNCTION 

C  ISIGN«»1,    DISTANCF  "EASUPE 

C  ISIGN-rl,   CORRELATION  mcasuRE 

C  COLS     b-    7     NTSV-TAPE   'INTT   0^   WHICH  CLSTR   RESULTS    ASC  SAVED 

C  NT«V«7,    PUNCH   PFSULTS    ON  CARDS 

C  NTSV.LE.O.    DO   NOT   SAVE  RESULTS 
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C  COLS     8-  9     NTIN»UNIT   FROM  WHICH  SIMILARITY   MATRIX    IS  READ 

C  NTTN«5>    CARD  READER 

C  NTIN.NE.5»   DISK   OR   T  AP  F 

C  COLS   10-12      INOPT. INPUT  OPTION  FOR    SIMILARITY  MATRIX 

C  INOPT.LE.O*    THF  LOWER   TRIANGULAR   MATRIX   IS   STORED  AS 

C  ROWS    IN  ONE   LONG   LINEAR   ARRAY   A  NO   R F A P  IN 

C  IN  ONE  RErO<*0   ON  NE*(NE-l>/2  ELFMENTS 

C  TNQPT.GT.O*    THE   LOWER   TRIANGULAR   MATRIX   IS  CONMDEPEO 

C  TO  BE   STORED   BY  ROWS   IN  ONE   LONG  LINEAR 

C  ARRAY   AND   IS  READ   IN  BLOCKS   *INPPT*  LONG. 

C  COLS    13-1<*     KOUT»OUTPUT  0 0 T I  ON 

C  K0UT»*2>    STANDARD  OUTPUT 

C  K0UT»-2#    STANDARD  OUTPUT   PLUS   PUNrHfD  SEQUENCE  LI«T 

C  FROM   SUBROUTINE  +TRFF* 

C 

C***ANY   PRE°QSITIONtNG   OF   THE    I/O   UNITS   NTSV   AND   NT  TN   MUST  "*E 

I  ACCOMPLISHED    IN   PROGRAM   DRIVER   OR   THROUGH   USE    OF   CONTROL  C«ROS. 

C 

C  CARD    3      INPUT   FORMAT    FOR    SIMILARITY   MATRIX    (20A<»  FORMAT) 

C  CARD«S)    tf     SIMILARITY  MATRIX 

C  CARD   b      END   OF   RECORD   CARD  (7/8/9) 

C 

C***INCLUDE   CAPOS   <•   AND  5    ONLY    IF    THE   SIMILARITY   MATRIX    TS   ON  r«Of)S*** 
C 

C  CARD(S)    6     LABEL   CARDS   FOR   ENTITIES.      THERE   ARE   TWO  OPTIONS 

C  l.      INCLUOF   l    CARD   WITH   T"F   t,   CHARACTERS    +NOLB*    IN  COLUMNS 

C  UNDER   THIS   OPTION  LABELS   APE   NOT   PRINTED  ON  THE    TREE   OUTPUT • 

C 

C  2.      INCLUDE   NE   CAPOS.    COLUMNS    l   TO    20   CONTAINING    A   LABEL    fqo  nNc 

C  c  N  T I TY .      ORHER    THE   L  A 1 E  L   CARDS    IN   THE   SAME    SEOUFNCE    AS  THE 

C  ENTITIES    ARC   PEPRFSENTED    IN   THE    SIMILA9ITY  MATRIX. 

C  

c 

C  DECK   SETUP  SPECIFIC&TIONS 
C 

r  THE    USER    PROVIDES   PROGRAM   TR I  V  EB   WHICH   PERFORMS    THE    FOLLOWING  tasks. 

C  1.     ASSIGNS    INPUT/OUTPUT  UNITS 

C  2.      ESTABLISHES    THE    DIMENSION   Oc    THE    X   ARRAY    AND    SETS  THTS 

C  DIMENSION   EOUAL    TP  LIMIT. 

C  3.     CALLS   SUBROUTINE  CNTRL. 

C  THE    FOLLOWING   EXAMPLE   WILL   SUFFICE    IN   MOST  CASES. 

C 

C  PROGRAM   ORIVER    ( I  NP  UT.  O'J  T  P  •  IT  t  PUNC  H,  T  A  p  E  5  *  T  N  P'  IT  .  T  A  P  E6  «  OUT  PUT  , 

C  ATAPE7«PUNCH» TAP=1, TAPE2) 

C  DIMENSION  X(7000) 

C  LIMIT*7000 

C  CALL    CMTRL (X. LIMIT) 

C  END 

C 

C  A   SECJNO   JOB   DEPENDENT   SEGMENT   IS   SUBROUTINE   METHOD.      THE  USF" 

C  StLtCTS   A"ONG   THE   SEVERAL   ALTERNATIVE   VERSIONS   OF   THIS   SUBROUTINE  Tn 

C  IMPLEMENT    THE   OESIRFQ   CLUSTERING  TECHNTQUF. 

r 

C  THE  SUBPROGRAMS  CNTRL.  r  L  S  TR  »  MTXTN,  L  F  I  NO  AND  TRFF  GO  IN  EVERY  J  0° . 
C 

C  THE    X   ARRAY    IS   PARTITIONED   FOR    c  TOR  AGE   AS  FOLLOWS 

C  STORAGE   FOR    ARRAYS    NEEDED   AT    ALL   STAGES   OF    THE  JOB 

C  X(N1)    TO   X(N2-1)      NF   wn<>DS —  STORAGE    OF    TH c    II  ARRAY 

C  X(N2)    TO   X(N3-1)      NE   WORDS — STORAGE   OF    THF   JJ  ARRAY 

C  MN3)    TO   X(N«,-1)      NE   W0KD5 —  STHPAGF    CF    THF   SS  ARRAY 

C  X(N<.)    TO   X<N5-1)      NE   WORDS  — STORAGE    CF    THE    IL  ARRAY 

C  X(N5)    TO   X(N6-1)      NE   WORDS — STORAGE    OF    THE    JL  ARRAY 

C  X(N6)    TO   X(N7-1)      NE    WORDS — STORAGE   OF    THE   NEXT  ARRAY 

C  STORAGE   ^OR    ARRAYS    NEEDED   TN   SUBROUTINE  CLSTR 

C  M  1 «  N  7 

C  X(«ll)    TO   X(M2-1)      (NEMNF-l)  )/2   WOROS  — STORAGE   OF    THE    S  ARRAY 

C  X(M2)    TU   X(M3-1)      NE   WORDS  —  STORAGE   OF   THE   L»ST   APR  AY 

C  X(M3)    TO   X(M^-l)      NE   WORDS — s  y  qp  AG  F   OF    THE   NEAR  ARRAY 

C  X(M<,)    TO   X(,15-l)      NE   WORDS  —  STORAGE    OF    THE    S  R  E  F  ARRAY 

C  X(M5)    TO   X(M4-l)      NE   WORDS — STORAGE   nf    THE   LIST  ARRAY 

C  X(M6)    TO   XM7-1)      NE    WORDS — STORAGE    nf    THE    A  ARRAY 

C  X(M7)    TO  X(MP)          NE    WOPOS  —  STORAGE    OF    HE   B  ARRAY 

C  STORAGE   FOR   ARRAYS    NEEDED    IN   SUBROUTINE   TRFF    (OVERLAY    ARRAYS  vjccQEn 

C  IN   SUBROUTINE  CLSTR) 
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C  L1*N7 

C      X(ll)    TO   X(L2-1)      25*NE  WORDS  —  STORAGE   OP    THE   A  ARRAY 

C     X(|_2)    TO  XIL3-1)     5*NF   WORDS  —  STORAGE   OF   THE   LA«*EL  ARRAY 

C     X(L3)    TO   X(L*-1)      N E   WORDS— STORAGE   OP   THE   LCLNO  ARRAY 

C     T(L4>    TO  X(L5-1)     NE  WORDS— STORAGE  OP  THE  LINE  AROAY 

C      X(L5)    TO   X(L6-1)      NE    WORDS — STORAGE    OP   THE   IS  ARRAY 

C     X<LM    TO  X(L7)  NE  WORDS — STORAGE  OF   THE  LAST  AORAY 

C 

INTEGER  FIRST 

OIMFNSION  X(1)»PMT<20),TITLE(20).E°S(25) 
CALL   SFCnND( TI"E  ) 
WOITE ( h» 2000)  TIME 
READ(5»1000)  TITLE 

READ(5»1100)  NE»ISIGN,NTSV»NTIN»TNOPT»KCUT 
WPITE<6»2500)  TITLE 

WRITE (6»  ??00  J  NE»ISTGN,NTSV.NTIN,TNnPT.KOUT 
C      °A<TITIOM   THE    bTOO  AGE  ARRAY 
Nl-1 

N2-N1 +NP 
N3»N2*NE 
N<»*N3*NE 
H  5  zKlA  +  N  P 
N6«N5+NF 
N7*N6+NE 

M2«N7*(N£+(NF-1 ) )/2 
M3*M2*NF 
m<,  »M3  +  NF 
M5»M<»*NE 
M6«M5*NE 
M7»M6*NE 
MP»M7*NE-1 
L2*N7+25*NE 
L3«L2*6*NF 
L<»«L3+NE 
L5«L<t*NE 
Lfc*L  C*NE 
L7»L6+NE-1 
1     CHECK    FOR   SUFFICIENT  STORAGE 

M  A  X  x  h  g 

IP(L7.GT.MAX)  MAX=L7 
/JPITE<6,?300)  M4X,LI^TT 
IF(MAX.GT. LIMIT)  STOP 
C      'F'VD    THF   SIMILARITY   M  A  TP  T  X 
PEAD(5»1000  )    FM  T 
WRITE(6.2100)  FMT 

CALL  mTXIN(X(N7)»IN0PT.NE,NTTN.cmT) 
C     READY  TO  CLUSTER 

60  CALL  CLSTP(X(Nn»X{M2),X(NJ3)tX(M<,),x(N5)»X(K6)fX(N7),X(M?),t(M3), 

AX(M<,),X<«5)*X(W6)tX(»7),TITLF.NEtIStGN,NTSV) 
C      READ  LA^EL   CARD(S ) 

FIRST=L2 

L AST*L2>5 

rfad(5.1000)  (x(t). t«first,last) 
if(X(fir«;t)  ,  =  o.^hnolb)  go  to  *o 
c    read  regaining  label  s 

t AST*L2-1 

DO   70  K*1,NC 

FTRST-L AST*1 

LA^T»LAST+6 
70  READ(9»  1005)    (  X  ( I  )  ,  I  * F  I  R S T , L A S T ) 

C     DRAW   THE   TREE  CORRESPONDING   TO   THE   r  L  US  T  E' ING 
80         IF  RG E  S  «N  E— 1 

CALL  TREE(X(Nl)»X{N2)»X(N3)»X(N<f),X(M5)tX(N6).X(M7),X(l?),X(L3)» 
AX(L<i)»X(L5)»X(L6)»EPS»TITLE»WERGFS»l»6»l.KnUT,NE) 

CALL   SECOND  (TIME) 

WRITE(6»2000)  TTPP 

RETURN 
10C0     COPMAT  (  20A<i  ) 

1005  FORMAT  (//3X,5A5»A<.) 
1100  F0RMAT(I3»3I2.T3tI2) 

2000     FCRMAT( 12H1TIME    IS    NPW»F10.3.*H  SECONDS) 
2100  FORMAT(7-IOFrRMAT,2CAA) 

?200      PPPMAT(5H0NF    »#IE./»6H   ISIGN   ■»I5»/»7H  N  T  5  y    »»I6#/.7H  NT  T  N  «»I6» 
A/,*H   INOPT   «,I5»/»7H   KOUT   *  ,  T  f- ) 
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?300      FPRMAT( 13H0RE0UIR6D    S  TOR AGE    «,I5»6H  WORDS,/, 
A  19HO ALLOTTED   STORAGE   «»I5»6H  WORDS*/) 

2500     F0RMAT(lH0»20A<t  ) 

END 

*  EO  R 

FUNCTION  LFIND(I,J) 
C      IF    THF    LOW PR   TRIANGULAR   PORTION  OF   A   SYMMETRIC   MATRIX    TS   STORED   C Y 
C      ROWS    TN    A   ONE-DIMFNSIONAL    ARRAY,    THEN   THE    ELFMENT    <T,J)    I N   TH^  FULL 
C      MATRIX    IS    ELEMENT   LFIND(I»J>    IN    THE    LINEAR  ARRAY 

I F ( I  .GT  .  J  )    GO   TO  10 
C     ROW   J  ,   COLUMN  I 

L  F  I N  0  ■ '(  (  J  - 1 )  *  (  J  -  2  )  )  /  2;t;I 

RETURN 
C      ROW    I,    COLUMN  J 
10         LFIND»( ( l-l) + ( 1-2) 1 /2+J 

RETURN 

FND 

*ECR 

SUBROUTINE  CLSTR(II»JJ»SS»IL»JL»NEXT»S,LA^T»NEAR»SREC»LIST»A#B» 
ATITLF»N> I  SIGN, NT) 

Z  IN  THIS  VERSION  THE  LOWER  TRIANGULAR  PORTION  OF  THF  SIMILARITY  M  A  T  R  I  X 
C      IS    STORED   QY   ROWS    TN   THE    ONE -DIMENSIONAL    ARRAY  S. 

C 

C      THF    FOLLOW  TNG   VARI  AB L  F  S    ARE    SpEClFTtD    IN   THF   CALLING    oROC-PAM  AND 
C      ARF    PASSED    THROUGH    THE    ARGJMFNT  LIST 
C  N "NUMBER   OF    nqjffS    Tn   Bc    r  L  US  TE  R  E  D 

C  S(J)=J-rH    FLFM^nT   IN   LOWER   TRIANGULAR    SIMILARITY    M  A  T  R I  X 

C  ISIGN-0PT10N    SPECIFYING   T  Y  °E    OF    SIMILARITY    FUNCTION  USED 

C  ISIGN»+1»DISTANCE  MFASUPE    (DECREASING   FUNCTION   OF  SIMTLARITY) 

C  L  S I GN ■ -1 "CORRELATION   MFASURF    (INCREASING   FUNCTION   OF  SIMILARITY) 

C  N  T  =  T  A  P  E   UNIT   ON   WHICH   THE    RESULTS    ARE  SAVED 

r.  NT.Lc.O«DO   NOT    SAVE    RESULTS    ON   T  \  °  c 

1  NT»7"SAVE    RESULTS   ON    PUNCHED  CARDS 

C  TITLF-IDENTIFYINR   TITLE    C0R    THIS  RUN 

C      THE    FOLLOWING    VARIABLES   RF°RESENT   THF   OUT°UT   OF    Tme    PROGRA*   AND  ARE 
C      °ASSEO   BACK.    THROUGH    THE    ARGUMENT   LIST,      TMESE    RCCULTS    4  R  E    RF1DY    F  OR 
C      SUBROUTINE    T R  E  F  , 
C         <  =  S  T  A  G  E  OF  CLUSTERING 

C  II(K)«LGWFR   NUMBERED   r  L  US  T  E  R    MEPGE^   AT   STAGE  K 

C  JJ(K)"UPPtP    NUMBERED   CLUSTER    MERGED   AT    STAGE  K 

C  5S(K)«V*LUE   OF    STMLARITt"    FUNCTION    ASSOCIATFD   WITH   ME 0  G c    AT   STAGE  * 

C  I  L  ( K )  *  PR  EC  EDTNG   STAGF   AT   WHICH   II(K)    WAS    LAST    IN    4  MERC-C 

C  JL(K) "PRECEDING    STAGE    AT   WHICW  J  J  (  K  )    WAS    LAST    IN    A   M  E 0  GE 

C  NEXT(K)=NEXT    STAGE    AT   WHICH    TI(K)    IS    I N   A  MERGE 

C 

C  T N  ATJ1TI0N.  THE  FOLLOWING  VARIABLES  PLAY  IMPORTANT  ROLFS  TN  Tme  pR^G 
C  NEAR(I)»ID    NUMBER   OF    EXTREME    FLEMcmt    in    ROW    I    OF    TMC  LOW" 

C  TRIANGULAR    SIMILARITY  MATRTX. 

C  S  R  E c  (  T  )"SIMU4RITY   MFASURE    FOR    THE    PAIR  (I,NEAR(I)) 

C  LTST(I)»I-TH  CLUSTER  ID  NUMBER  IN  SEQUENTIAL  LIST  OF  current  CLUSTE 
C  NCL "NUMBER    oc   CLUSTERS    AT   CURRENT  STAGE 

C  L AST( I ) «ST AGE   N  U  M  B  c  R    AT    ^HICH  C  L  U  S  T  F  R    I    WAS    LAST    TN    A    M  E  R  G  E 

Z  A  » WOR  KING    AREA    FOR    SUBROUTINE  METHOD 

C  B*WORKING    AREA   FOK   SUBROUTINE  METHOD 

r 

C      THIS   SUBROUTINE    USES   FUNCTION   L  F I N  D (  I, J  )    TO   FIND    THE    ADDRESS    IN  S 
C      FOR    THE    SIMILARITY   MEASURE    BETWEEN   CLUSTERS    I    AND  J 

JT MENS  ION  S(1).II(1),JJ(1»,SS(1),IL(1),JL(1),NEXT(1)»NEA»(1), 
ASk  EF ( 1 ) »  LIST( 1 ) , LAST(  1) , A ( 1  )* B( 1 ) 

DIMENSION  TTTLE(?0) 
C      INITIALIZE   VARIABLES    AND    SET  CONSTANTS 

NC  L  "N 

K=  1 

SIGN.I SIGN 
BIG-SIGN*!. =50 

CALL   METHODIS, NEAR, SPEF,Li<;T»A»a«SREFX, SIGN,  N, NCL, LREF*NREC.,1) 
C      INITIALIZE  ARRAYS 

DO    10   J  *  I  ,N 

L  A  S  T ( J ) "0 

NFXT ( J  )  -0 

LIST! J)«J 

SPEF( J)=BIG 
10  CONTINUE 
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C      PINO   EXTREME   ENTOY   IN    EACH  ROW 
L«0 

DO   30    1-2. N 
Il«I-l 

OP   30  J-1.I1 
L«L*1 

C      T N   EFFECT  S(L)»S(I.J) 

ICM  (S(L)-SOEF(I)  )*SIGN)  .GT.O.)   GO  TO  30 

N  E  *  B  (  I )  ■  J 

iPEF(  I) «S(L  ) 
JO  CPNTTNUE 

C     M  A  T  N   LOOP.      FIND  E  XTREMF   VALUE    IN  SRE*  ARPAY 
40  SReFX«9I3 

00   50  I-2.NCL 

LISTI«LIST<  I) 

IF  (  (  (SREC  (LISTI  )-SREPX  l*SI«l  .GT.O)    GO   TO  60 
IPEF«T 
LREF.LI STI 
SREFX-SREF ( LISTI ) 
50  CONTINUE 

C     L B  t  F    IS   THE   ROW  NUM3E*   CONTAINING  THE   EXTRFMF   ENTRY    IN   THE    ?  A  R  D  A  Y  , 
C     Ic   THERE   ARE    TIES.    THEN  L R Ec   TS   THE  HIGHEST  NUMBERED  »0W  WTTH  THIS 
C     EXTREME  VALUP.     HENCE  L R E F . GT . N E AR ( L R EF ) .     IRE^   IDENTIFIES  THE 
C     PLACfMENT   3C   LREF    IN  THF   LIST  ARRAY. 
NRCF«NEAR(LPEF) 

CALL  ^cTHOO ( S . NE  ar# srff»list»a»b»srefx»sikn»n»ncl»lref»nrcf.?) 

C.     GENERATE   MERGF    DATA  NEEDED   FOR    SUBR^UT  TNE  T<?EE 

II (K)»NRcF 

JJ(K)«LREF 

SS (K>»SREPX 

IL(KI«LAST(NRFF) 

JL(K)»LAST(LREF) 

L»ST(NREF)«K 

TF(IL(K)  .FO.O)    GO   TO  60 

ILK-IL(K) 

NE  X  I( ILK) «K 
60         IF < JL (*) .60.0  )    GP   TO  70 

JLK« JL (K) 

NEXT ( JLK  )  «K 
70  K«K+1 

C     TERMINATE   IP   N-l   M  E  R  G  F  S   HAVE  OCCURFO 

IP(K.EO.N)    53   TO  140 
r      U°OATE    FOR   THE    NEXT  CYCLE 

NCL  *NCL-1 

IF ( I"EF .GT.NCL  )   GO   TO  QO 
C      'lo^ATE   LI^T    APrfAY    BY   REMOVING   L  R  E  c    A  ND    PUSHING   OPWN   THE  LIST 

OG   HO  I-IPEP.NCL 
°0         LIST( I ) «LIST( 1*1) 
1      U  °D  AT  E    FOR   NEXT  CYCLE 

90         CALL  METH0D(S»NEAR,SPPF»LIST,A»3.SOEFX,SIGN.N»Nri»LREF>NREc»'») 

GC   TP  40 

r      CL'JSTcRTNG   FINISHED  ANO   ALL    ANCILLARY    INFORMATION  GPNERATEO. 
C     SAVE   RESULTS   AS  DCSIRED. 
140  K*K-1 

160       IF(NT.LE.O)  RETURN 

^SITE(NT,2300)  TITLE 
DO   170   1*1, K 

170       wRITE(NT,22C0)    I . I T ( I  )  » J  J ( I > » S S ( I ) . 1 L < I ) » JL ( I > » N E X T ( I ) 

kETlPN 

?2C0     FORMATC 31 1C. E16.8. 3110) 
'300      FCRVAT( ?PA4 ) 
F  NO 

*E  PR 

SUBROUTINE  MTXIN(X.IOPT.NF.NTIN.FMT) 
C     THIS   SUBROUTINE    READS    A  LQJER   TRIANGULAR  "AT°IX  *X*  P E »R F S CN T T N G 
r      ASSHCTATITN    AMONG   *N6*    ENTITIES.      THF   MATPTX    IS   READ   F  R  0M  UNIT    *N  T IN  * 
C      IN    FUR  MA  T   +FHT*.      THE   MODE   OF    INPUT   Fnp    THF    MATRIX    TS   DETERMINED   3  Y 
C      THE    *IOl>T*    PARAMETER   AS   FOLLOWS . 

C  ICPT.LE.O,    M&TOTX    IS   PEAO    T  N   LOWE0    TP  I ANGUL  AR   eGPM  BY   ROW*.  E*rH 

C  ROW   BEING    A    NEW  RECORD. 

r  IP°T.GT.O.    MATRIX    IS   R  E  AO    IN  CONSTANT  LCNGTH   BLOCK*:.    CACH  *t-iot* 

C  WOODS  LONG. 

HIMENSION   FMT  (20  ) . Y  (  1  ) 

INTPGER  FIPST 
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IF(IOPT.LE.O)   GO   TO  30 
C      READ    THE  SIMILARITY   MATRIX    IN   310CKS    *IOPT*  LONG 
F  I  P  S  T" 1 
L  AST» I3°T 

10         READ(NTIN,F1T)  (X(I),I«PIRST,LAST) 

C     USE   THE   END  OF   RECORD  CARO  TO   S I GN  T  F  Y   ENn  qc   T  HF  STMUAPITY  MATRIX 

TF(    EOF(NTIN)    )  60,20 
20         F  IRST-FIRST+IOPT 

LAST»LAST*IOPt 

GO   TO  10 

C  THE   SIMILARITY   MATRIX   AS   ROWS   OF    A   LOWcR    TRTANGULAP  "ATPIX, 

r.      T  N   ONE   RECORD   °F   NC  WORDS 
30   NC«(NE**2-NE) 12 

READ(NTIN,FMT)    ( X { T ) , I « 1, NC ) 
IF    (    EOMNTIN)    )  200,40 
C     PASS   THE   END   OF  PILE 
«.0   RFAD(NTIN»FMT)  Z 

IF(    EOF(NTIN)    )  60,?10 
60  RETURN 
C     cRROR  MESSAGES 
200       WRiTE(6,  2<t00) 
STOP 

210  WPITE(6.2500) 
STOP 

24G0     FORMAT(  36H0E0F   ENCOUNTERED   WHEN   NONE  EXPECTED.) 
?500     FORMAT(  30H0N0   EOF   WHEN   ONE    WAS  cXOECTED.) 
END 

+  ECR 

SUBROUTINE  TREF(I»J»?tIL»JL»NEXT#A,LA^FLtLCLNO»lINP,IS.LAST,FP^, 
*TITLF.tN»<BEGtNT» INTR  V» I °R  NT  • PAXIN ) 

C 

C      DATA    T  N°UT   THROUGH   CALLING  SEQUENCE 
C 

C     N«HIGHEST  STSGE   NUMBER    IN   THE   CLUSTER   M  ER  G  E  DATA    (MUST  Re  Fx&rT) 
C     <BEG"STAGC   NumbcR   AT  WHICH   THE    TREE   BEGINS.    DEFAULT  VAL''F  1 
C     N  T»  T  A 0 1   NUMBER    FOP   PRINTED  OUTPUT,    0  Ec  AULT  VALUE   ■  6 
C      INTRV*INTE"V AL    OeTION   FQO  SEGMENTATION 

C  INTRV"!  «D£P  AULT  VALUE,  CONSTRUCT  EPS  BY  DTVIOING  THE  R  ANG  c  OF  S  T  N  TH 
C  25   EOUAL  SEGMENTS 

C      INTRV«2*EPS    IS    PROVIDED   AS    PART   OF    THE    ARGUMENT  LIST 

C      INTRV*3»THF    I S    ARRAY    TS   ALREADY   CONSTRUCTED   AMD    EPS    IS    PRnvTOcD   c0»  T 

r      [P9NT*PPTNT  0DTION    ftb    IN<"JT  INFORMATION 

C      I ARS( IPRNTJ.l,    PRINT   ONLY    TITLE    AND   *IS*  ARRAY 

C      T A3S  (  IP&NT ) ,NE . 1.    TN   AQCTTION   PRINT   THE  CUSTER    "FRGE  DATA 

C      IPRNT.LE.O,    IN   ADDITION,    PijNCH    THE   SFQUENCE    I N   WHTCm   THE  FV'TTTTFS 


C  APPEAR    IN    THE    TR«=E    (NEEDED   FOR   3OST-ANALYS  IS   OF  DATA 

C  UNIT  CLUSTEPTNG   IN   SUBROUTINE  *ROSTDU*). 

C  cPS(M)«>?I3HT   ENDPOINT   FQR   THE    MTH   INTEP  VAL   USED   FQR    SEGMENTING  S 

C  L  A^EL  (M,  IJ  )  =  MTH   OF    5   WOBO-S    T  DE  NT  T  F  Y  T  NG   THE    IJTH  OBJECT 

C  TITLF.ARPAY   OF    20   WORDS   FOR    IDENTIFYING   THF  RUN. 

C  K*INDEX    IDENTIFYING   STAGE   *UM3FP    IN    THE  CLUSTERING 

C  I(K). LOWER   NUMBERED  CLUSTER    IDENTIFICATION  NUMBER    IN    THE    M  E  SGE    AT  T<J£ 

C  K  TH  STAGE 

C  JCO'UPPER   NUM9FRE0  CLUSTER   IDENTIFICATION  NUMBER    TN   THE   M  ER  GE    AT  Tmc 

C  K  TH  STAGE 


C  S(<)»VALUE   ]F    THE   CPITEPION   FUNCTION   FOR    THE   MFRGE    AT   THE    KT"  STAGE 

C  IS(K)«CATEGORIZED   VALUE    OF   S»    INTEGER    IN   PANGE    1    TO  25 

C  IL(«)»STAGE   NUMBER    WHEN    I(<)    WAS   LAST   TN   A    ME  BGE    (0   FOP   FIRST   M  ERG  E  F 

C  J  L ( K  )  « S  T  AG  F   N UM  B  E R   WHF  N   J(K)    WAS   L«ST   T N  A    MFRGE    (0   FOR   F I k  S  T   MERGE  F 

C  NEXT(K)»STA3F  NUMBER   WHEN    I(K)    IS   NEXT    IN   A  MEPGE 

C  MAXIN-HIGHEST   CLU«TER    ID  NUMBER    IN    THE   CLUSTER   "FRGE  DATA 

r 

C  OTHEP    VARIABLES   USED    IN   THE  PROGRAM 

C 

C  LINE(I)-LINE    NUMBER    I N   THE    PR  T  MTOUT   AT   WHICH    T ( K )    IS   CARRIED    ( AFTER 
C  MOST  RECENT  *  E  P  G  c  ) 

C  L  C  L  NO  (  L  I  ■  THE   CLUSTER   NU*8FR   TO  BE   PRINTED   ON   LINE   L   AT  THE   LEFT   OF  TH 

C  A  (  M ,  L  )  1  THE    MTH   SEGMENT    (OF   25)    IN   THE   L  TH   LINE    OF   THE  OBINTI'IT 

C  LAS T(L ) ^FARTHEST   BIGHT    SEGMENT    TN  LINE    L   WHICH    IS   NOT   Bl ANK 


C 

C  IN  ADDITION,    COM" JN   BL"CK    /TCSCOM/   PROVIDES    3  AR  A  Mt  T  E  R    ICS   TJ  ALLOW 

C  CNLY    A   PORTION    (HORIZONTALLY)    <?F    THE    TRFC    TO   B  E    SHOWN,    THUS    E  X  "AND 

C  ING   THE   DETAIL    OF    THE    TREE.    SINCE    AN   E  X  TP  E  M  E    ISOLFT  CAN   HAVE  THE 

f.  EFFECT  OF    SOUASH^NG   MERGES    TOGETHFR    IN    THE    REST   OF   THF  TRcf. 
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C  DEFAULT   ICS   *   0  PRINTS   THE   ENTIRE    TREE.      ICS   >  0   PRINTS   MERGES  OF 

C  LEVEL   'J P   THROUGH   THAT   GF    THE   TCS-TH  MERGE.      ICS   <   0  SELECTS  AUTO- 

0  CUTOFF ...  LOOKING   AT   THE   LAST   5  M£PGES   TN  REVERSE   ORDER*    I F  THE 

C  MERGE    EXISTS    UNIQUELY   FOR   MORf   THAN   1/5   THE    ENTIRE    RANG^    Oc  DIST- 

C  ANC^St    THF   «AXIMUM   DISTANCE    SCALED    IN   THE   DENOOGRAM    IS  S^T 

C  TO   THE   DISTANCE    OF    T"E   NEXT   HIGHEST  MERGE. 

C 

DIMENSION  I(N),J(N),S(N),IS<N),TL(N)»JL(N),NEXT(N),LCLNO(N), 
AA(?5»N) »  L  A  S  T { N ) 
DIMENSION   LlNE( MAXIN) » LABEL ( 6  »  M  A  X  I N  ) 
DIMENSION  EPS(25).TITLE(20) 
CO^ON   /    ICSCO*   /  ICS 
REAL  LABEL 

TATA   RAR1»PLNKT»BARS»BLANK/4H  T»4H        T,4H  ,  <*H  / 

DATA    ICS    /    0  / 
C      DEFAULT  VALUES 

K3cG   *   MAX0(K<?FG,  1) 

IF ( INTRV.LT. L.OR.INTRV.GT.3)  INTRV-1 

IF(NT.LE.O)  NT»6 
C      INITIALIZE  ARRAYS 

NOBJ  *N*1 

00   10  <»1»N09J 

L1NE(K)»0 

L  C  L  N  0  (  K ) =0 

LAST(K) -0 

DO    10  L*l»?5 

A(L»K)«3LANK 
10  CONTINUE 
C      SEGMENT   THE    S  ARRAY 

IF    (    INTRV  -   2    )    20,    40 t  120 
C     CONSTRUCT   INTERVALS   0  F    EOUAL  LCNGTM 
20   SRMAX   «  S(N) 

IF    (    ICS    )    24,    2b,  22 
??   SPMAX   a    S( ICS  ) 

GO   TO  2S 
C       ICS   <  O...JSF  AUTO-CUTOFP 
24   OC   26   L    »    1,  5 

IF    (    SIN-L  +  1)    -  S(v-L)    .  GT .    (SR^AX   -  S(D)    /   ^ .  C    )    SRMAX   k  S(n-d 
26  CONTINUE 

?s    RANGE    *   SRMAX   -  S(KRCG) 

0ELTA«RANGE/?5. 

tDS(l)-S(KRcG)+3cLTA 

00    30  K*2,24 
30  F0S(K)=PpS(K-l)+-DcLIA 

E°S  <  25 ) -S (N  ) 
C     CONSTRUCT   THE    TS  A<?PAY 
40  IF (EPS(l) .GT.EPS(2) )    GO   TO  70 

C      S    INCREASES   *ITH  DICSIMTLARITY    ( A s    )Q£5    a  DISTANT) 

K  K  =  1 

n0    60  K»l#N 
50  IF ( S ( K  )  ,L E  .c PS ( KK )  )    GO   TO  60 

I F (KK .EO. 25  )    GP    TO  60 

KK*KK  +1 

GO    TO  iO 
60  IS(K)«KK 

GO   TO  120 

C      S    DECRPASCS    ^  I  TH   DISSIMILARITY    (AS   DOES    A   C 0 R P E L A T I  ON  ) 
70  KK*24 

KKX  =25 

NN»N*l 

00   00  K=1,N 

KCOMP iNN-K 
SO  IF < S ( KCOM°  )  .  LT. E°S ( KK  )  )    GO    TO  90 

KK  K  «KK 

KKxKK-1 

IF(KK.PO.O)    GO   TO  100 

GO   TO  SO 
30  I S ( KCOMP ) =KKK 

100        00    110   K»l, KCOMP 

110  IS(K)«1 

r     3RINT   INPUT   TO  TREE 

1?0       *PITE<NT,2000)  TITLE 

WPITE(NT, ?100)    K  R  S  G  ,  N 

WR I T  F ( N  T  » 2200  ) 
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WRITE(NT.2300) 
M«l 

WRITE(NTt2<»00)   M,S(KBEG)  ,E°S(M) 
01  130  M»2,25 

130       -JRITe(NT»?400>   M,E°S(MM) ,EPS<M) 

If(iA9S(lP»MT).EQ.i)   GO  TO  150 
C     »»INT  THE  CLUSTER  ME  RG  E  DATA 

WRITE(NT,2000>  TITLE 

WBTTE(NT,2500) 

00   1*0  K  »K9  EG  »  N 

*R1TE(NT»2600)  KtI(K)»J(<)»S(K)>IS(tf)«IL(K)»JL(K)»MEXT(K) 
UPtTE{7»4  321)  K,I(K),J(K),S(K),IL(K),JL(K),TS(K) 
4321   CC^^AT  (3I5»E16,<».3I5> 
1*0  CONTINUE 

r     START  TREE  WITH  THE  MOST  STMUAR  "AIR 
150  K-KBEG 
LNO-0 

C     «E*GE  CLUSTERS   I(K)    ANO  J(<) 
l*>0  IK«I(<) 
JK»J(K) 

C     ^FT  LINE  NUMBERS  F0»  OUTPUT 

IP(IL«).NE.O)   GU  Tn  170 

LN0»LN3*l 

LINE ( IK)«LNO 

LCLNO (LNO ) ■ TK 
170        IF(JL(K).NE.O)    GO   TO  1<»3 

LNP«LN0*1 

LINE( JK)»LNO 

LCLNQ{ LNO) »JK 
C     FTLL    IN   THE   PRINT  LINFS 
190  TSK«!S(K) 

KT*0 

ITEM-IK 
190       L1TEM«LINE<  I  T  ) 

I F ( I SK-L AST  I L I TEM ) -1 )   22 5 t 200*210 
C     AD3  ONLY  ONE   mhrf   SFG*ENT  FTP  L INE  ( I  TP* ) 
200       A(  ISK,LITEH)**ART 

LAST(LITE1)-ISK 

GO  TO  2?5 
C     AOD  "gpf  THAN  ONE  SEGMENT 
'10  LPEG»LAST(LITEM)*1 

LEND=ISK-1 

OP  220  L»LRtGiL=NO 
'2"       A(L»LITEM) «"?ARS 

GO   TO  200 
r.     RE°EAT   =  0R  CLUSTER  J(K) 
225  KT«KT+1 

IF(KT.NE.l)    GO   TO  230 

I TEM» JK 

GO   TO  190 

C      TAKE  CARE   OF    ANY   LINE*   BETWEEN    I(K)    ANO   J ( *  1 
230       LI<«LIN£< IK ) 
L JK»LINF<  JK ) 

IF(LIK.GT.LJK )   30  TQ  240 
LPOT-LJK 
LTOP»LIK 
GO  TO  250 
2*0  LBOT-LIK 
LTOP-LJK 

250       IF(180T.SO. (LTOP+l) )   GO  TO  '70 

C     MUST  cjll   IN  SOME  VERTICAL  CONNECTIONS 

LBEG=LT0P*1 

LFNO-LBOT-1 

OP  260  L»L8EG»LEN0 

IF(AIISK.L).EO.«»ARI)  GO  TO  260 

A  (ISKtL  >"*LNKI 

LAST(L) » I  SK 
260  CONTINUE 

C     UPOATE   LINE   NUMBER   FOR  NEW  CLUSTFR 
?70  LINE(IO»(lINE(lK)*LINE(JK)»/2 
C     ME*GE   COMPLETE.     F I  NO   NEXT  STAGE 

<L AST-K 

K»N£XT(K ) 
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IF(K.GT.  UOR.K.LT.KBFG)   GO   TO  400 

IF(  IL(K).LF.O)   GO  TO  2B0 

IF(JL(K> .LE.O)  GO  TO  2<?0 

GU  TO  300 
230  IL(K)»-IL(K) 

GO  TO  160 
?90  JL(K)«-JL(K) 

GO   TO  160 

C     T«IS  MERGE  INVOLVES  CLUSTERS   THAT  EACH  HAVE  "ORE  THAN  ONE  M,eMnpp, 
C     BACKTRACK  TO  THE  ROOT  OF   THE  TREE   ALONG  THE  MNEXPIORED  ORANCM. 
300       IF ( lL(K) .EQ.KL&ST)    GO   TO  310 

C      GO   DOWN    IL(K)    BRANCH,      SET  JL(K)    SO  WF   KNOW   NOT   TO  GO  DOWN   THAT  PRANC 
JL  (K)— JL(K) 
K«IL(K) 

GO  TO  320 

C      GO   DOWN   JL(K)    BRANCH.      <ET   TL(K)    SO   WE   KNOW  NOT   TO   r-*   DOWN    THAT  3RAWC 
310  IL(K)»-IL(K) 
K«JL(K) 

3'0       IF(K .LT .1 .OR .K.GT.N  )   GU   TO  600 

C      TEST  TP  SEE    I F   THE   END  HAS   1FCN  REACHED.      IL(K)*JL(K»    IFF   "  □  TH  7CT). 

IF(IL(K)-JL(<»)    33C. 160,350 
330        IF( IL(K) .EO.O)   GO   TO  360 
340  K«IL(K) 

GO  TO  320 
350       IF(  JL(<)  .EC-.O)    GO  TO  340 
360  K-JL(K) 

GO   Tn  320 
C     PRINT  TWE  TREE 
400       WRITE  (NT, 2000)  TITLE 

WRITE(NT.  3000  )  (K,K»1,25) 

ENDFILE  7 

I  F(  LAPEL  ( 1»  1 )  .CQ.4HNCL?)   GO   TP  *20 
DO  <>10  L»1»LN0 
LL-LCLNOU  ) 
WRITE (7, 41 7)  LL 
417  FORMAT  (15) 

410       W«ITE(NT. 3100)    <LABEL(K.LL),K»l,&),LL»(A(K,l ),K«1,25) 

GO  TO  440 
C      LEAVE    LABEL    SPACES  BLANK 
420       00  430  L»1,LN0 

LL«LCLNO( L  ) 

W<?ITE(7.417)  LL 
430       WR  ITE (NT.  3200  )    L L » ( A ( « , L ) . K » 1 » 2 5 ) 
C      TRCc  COMPLETE 

44C        W°TTE(NT»3000)  (K,K«1,25) 
EMTFILE  NT 

IF( I PRNT.GT.Ol  RETURN 
C      °UNCM   SEQUENCE  LIST 

WPITE(7,3900)  TITLE 
WRTTE(7,4000)    ( L C L V0( L ) • L « 1 . L NO ) 
RETURN 

C      ERROR.      PRINT    AS    MUCH   OF    THE   TP  E  E   AS   MIS   SEEN   C  TN  S  T  °l'C  TE  D 
600        WPITE(NT,6000)    <L  A  S  T  »  K 

GO  TO  400 
2000      FO»UAT( iHlf  20X,  20A4  ) 

'100  EOPMATf 65H0THTS  RUN  DEPICTS  THE  PORTION  OF  TH^  TRFE  GENERATE1*1  B  E  T  W 
AEEN   STAGC»    I5»10H  AND   S  T  Afi£  »»T5»1,?H   PF    TMF  CL  1  >S  T  c& I  NG  .  ) 

??00  FORf  AT(  63H0THE  CRITERION  VALUE*  ARF  SEGMENTED  T  N  TO  TWE  Frj(.LOWIMG  C 
UASSES.  ) 

2300      FORMAT!  6-IOCLASS,  5X,  UHIOWEO    60UN0  ,  5X  ,  1 1HU  P  0  ER  BOUND) 
'400     FO«MAT( IX, 15, 2E16.B  ) 

25  00  F0i,»lAT(1H0,9X,HK,qx,lHl,9*,lHJ,15X,lHS.8X,2HIS,8X,?HIL»('X,2HJL»6X 

A  t  4HN  EXT) 
26C0      EGRMATdX, 3110. E16.3»4U0) 

3000     F0'MAT( 10H0ITEM   N A" E . 19 X , 5H T 0  N0,1X,?5T4) 

C      IF    LOCAL   CONVENTIONS   PFOMIT.    REC0M1EN0   THAT  THE   CARRIAGE  C0NT9DI 

C     CHARACTER    IN   FORMATS   ?1C0   AN  0    >200   ALLOW  66   LINES  P  F   0  R  TN  T    PFP  PAGE, 

C      THAT    IS,    THE    MARGINS   AT   THE    TOP   ANO   BOTTOM   Op    THE   PACE   APE  SUP°RES*CD 

C     AND   POINTING    IS   SINGLE  SPACE. 

3100     FORMAT    (1H   , 5  A 5 , A 4, I X , I  4 , ' 5 A4 ) 

3200     FORMAT( 30X, 15, 25A4) 

3900  F0RMAT(20A4) 

4000      F  PS  m A  T ( 20  T  4 ) 

6000      F[JRMAT(  37W0ERP0R.    WHILE   B  AC  K  TP  AC  K  T  NG   FPQM   *LAST»T6»26H   K   W»S  FOU^O 
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A   OUT   OF   RANGE., /  »  1  X  , 3HK  «,I20) 

END 

♦  FOR 

SUBROUTINE  METHOD(S»NEAR,SREF,LIST, A,B,SREFX,SICN,N,NCL»LREF,NREF  , 
AJ  CB) 

C 

C     HIERARCHICAL   CLUSTERING   RY  STNGLF  LINKAGE.      ALGORITHM  IS  3FRIVE0 

C  FRO"! 

C     JOHNSON,    S.C.,    HIERARCHICAL  CLUSTERING   SCHEME  5 »    P SYCHOME TR IK  A, 
C     VOLUME   32,    NUMBER   3,    SEPTEMBER    1967,  PP2<»l-25*. 

C 

DIMENSION  S(1).NEAR(1),SREF(1),LIST(1),A(1),R(1) 
IF    (    JOB  -   ?    )    10,    15,  20 
C      JOB-1.  INITIALIZATION 
10  *RITE(6.?n00) 
2000   FORMAT (26H0S  INGLE    LINKAGE  CLUSTERING) 
BIG   »  SIoN+l.E50 
RETURN 
C      J0B*2,    DUMMY  ENTRY. 

15  BFTURN 
C      J0B"3»    UPDATE   FOR   NEXT  ROUND. 
20   SREe(NREF  )«BIG 
DO   50  J«1,NCL 
Z      UPDATE   ENTRIES    IN    S   ARRAY    ASSOC  T ATED    WITH  NREF 
I  ■  L I S  T  (  J  ) 

IF( I  .  EO.NREF  )   GG   TO  e0 
C      R  FC  ALL    THAT    LPEF   HAS   BEEN  REMOVED   F  801   LIST   SO   I    NF  ED   NOT   R F 
C     TESTED   FOR   EQUALITY   WITH   LRE  c 

LL»LFIND(  I, LREF  ) 

LN»LFIND(I,NREF) 

IF( ( (S(LL)-S(LN) )*SIGN),GE.O.)   GO   TO  30 
S(LN)*S(LL) 
30   IF(I.GT.NRFF)    GO   TO  <,0 
C      CHECK    WHETHER   S(LN)    IS    A    BETTER   CANDIDATE   FOR  SREF(NPEF) 
IF < ( (S (LN )-SREF (NREF ) ) *S IGN ) .GT.O . )    C-0   TO  50 
NEAR (NREF  )«l 
SPtF(NREF) *S(LN) 
GO   TO  50 

C      UPDATE    NEAR    AR<?AY   Fpij    JHO^E  <?OWS   WHOSE    EXTREME    ELEMENT   WAS    L  R c  F 
<»0    IF  (NE  AR  (  I  )  .NE  .LREF  .  AND  .N£  AR  (  T  )  .NE  .NRFF  )    Gn    TO  50 

NEAR (  I  )  »N  R  E  f 

SREF (  I  >  *S (LN ) 
50  CONTINUE 

R  FTURN 

END 

♦EOR 

SUBROUTINE  METHOD(S,N£A<?,SREF,LIST,A,o,SREFX,SIGN,N,NrL,LREC,MOEF^ 
A  JOB) 

C 

C      HTERARCHICAL   CLUSTERING   BY  rOMPLLTc   LINKAGE.      THE   ALGHRITHM  IS 
C      DERIVED  FR01 

C      JOHNSON.    S.C..    HIERARCHICAL   CLUSTERING   *CUFMES»    PS YCHGMETR  I K A. 
C      VOLUME    32.    NUMBER    3,    SEPTEMBER    1967,    PP  241-254. 

C 

DIMENSION  S(1).NEAR(1),SREF(1),LIST(1),A(1),9(1) 

IF    <    JOB  -  2    )    10,    15,  20 
r.      J0B»1.  INITIALIZATION 
10         WR I TE (6, 2000) 

2000      FORMAK 2HH0C0MPLETE    LINKAGE  CLUSTE&ING) 

BIG*SIGN+1.E50 

RETURN 
C      J0B«2,    DUMMY  ENTRY. 
15  R  E  TU'N 

C      J03«3,    UPDATE    fqr   NEXT  POUND. 
20         DO   30  J»1,NCL 
I  « L I S  T ( J) 

IFd.E3.NREF)    GO   in  30 
C      RECALL    THAT   L p  E  F   HAS    BEEN   REMOVFD   FROM   LIST  SO   I   NEED   NCT  BE 
C      TESTED   fqr   EQUALITY  WT  TH  LREF. 

LI =LFIND(  I, LREF) 

LN«LeINO( I , N  R  E  F  ) 

IF ( ( (S (LL )-S (LN ) )*SIGN  )  .LE.O)    G°   TO  30 
S(LN) «S(LL  ) 
30  CONTINUE 
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C     UPDATE   THE  NEAR   AND   S"EF   ARRAYS.     IF   THE  EXTREME  ELEMENT   TN  ROW  I 
C     WAS   EITHER  LREF  OR  NR£P#   THEM  IT  IS  NEr  ESS ARY  TO  FIND  A  NFW  EXTREME 
C     ELEMENT.     ROWS   »RIOR   TO  NREF  NEED  NOT  BE  CONSIDERED. 
40         00   50  J-l.NCL 
I «L I  ST ( J) 

IF( I   .EQ.   NREC )   GO  TO  55 
*0  CONTINUE 
55  IF(J.EO.l)   GO  TO  90 

60  SREF(t)»9IG 

J1»J-1 

00   70  L-1.J1 
LISTL«LIST(L> 
LL«LFIN0( I»LISTL  » 

IF ( ( ( S( LL )-SREF ( I) ) +S 1GN) .GE.O. )    GO  Tn  70 

NEAR(I)«LISTL 

SR?F ( I)«S(LL) 
70  CONTINUE 
°0         J«J*1  , 

IF(J.GT.^CL)  BETURN 

I-L.I  ST(  J) 

TP(NE  AR(  I)  .EO.LtiEP.OR.Nc  A«(  I  ).FO,NREF)    GO   TO  fO 

GO   TO  °0 
END 

+  EOR 

SUBROUTINE  METH3DfSf^EA0»SREP»LIST»NUMOR»5UM»SREFX»SIGN»N»NCL» 
ALREF»NRE<=,  JOS) 

r 

C  HIERARCHICAL  CLUSTFRING  BY  "INIMIZING  THE  AVERAGE  DISTANCE  OR 
C     .MAXIMIZING  THE   AVERAGE   CORRELATION   BETWEEN   THE   MERGED  C-ROU»S. 

C 

C     THE   ALGORITHM   IS   DERIVED   PR  OM  THF  *GPQUP   AVERAGE*   METHOO  DESCRIBED  I 
C         LANCE*    G.N.   AND  W.T.   WILLIAMS*    A  SENEGAL   THEORY   OF  CLAr^TcIC  ATORY 
C         SORTING  STRATEGIES*    I.   HIERARCHICAL   SYSTCMS»    THE   CO*PUTF?  jni.lRNAL* 
C         VOLUME         NUMBER   4,    "BPUARY   1967*  P&373-3B0. 
r 

DIMENSION  S(l)»NEAR(n»SREP{l),Ll?T(l),N,|Mpo(i)f<uM(l) 
TF    (    J0«  -  2    >    10*    ?5»  30 
C     JOS-1*  INITIALIZE. 

C     NUMBR  ( I  ) -NUMBER   OF    FNTTTIES  C  '.'R  R  FN  TL  Y   TN   TME   I-TH  CLUSTCR 
10         WRITEI6. ?000) 

'000     i=0RMAT(4?H0  AVERAGE   L I NK  AGE   BFTWEEN   THE   MERGED  G8nu«>S) 

00   20  J»l.N 
20  NUM*R(J)*1 

BIG«SIGN*1 . E50 

RE  TUON 
C     J0B»2»   nii"MY  ENTRY. 
25  RETURN 

C     J03»3»    UPDATE   POR  NEXT  ROUiD. 
C      MPOATE    THE   NEW  CLUSTER 

3  0         NUMB R(NREF)«NUM3P (NREF ) *N UM B R ( L R E F ) 

C     UPDATE   ENTBTES   IN   THE   REDUCED   SIMILARITY   "ATRTX.     THE   ENTRIES  ARE 
C     THE  SUM   THTAL  Oc   SIMILARITY  VALUFS   ASSOCIATED  WTTM  ALL 
C     PATRWISE   LINKS  BETWEEN  THE   E'.PMENT*  hf  the   TWO  CLUSTERS. 

DO  AO  J»l»NCL 

I»LIST( J ) 

IF(I.EQ.NREC)    GO  TO  40 
C     RECALL   THAT  LREF  HAS  BEEN  REMOVED  FROM  LIST  AND  TMEeEeORE   T   NEED  NOT 
C     «E  TFSTED  FOR  EQUALITY  WITH  LREF. 

LL»LPTNO( I*  LREF ) 

LN»LFINO( I#NREF) 

S( LN )«S(LN) *S  (LL  ) 
40  CONTINUE 

C     JAS  EITHER  LREF  OR  NREF*   THEN  IT  IS  NECESSARY  TO  FIND  A  NEW  EXTRFME 
C     ELEMENT.     ROWS  PRIOR   TO  NREF  NEED  NOT  3E  CONSIDERED. 

DO  50  J-l.NCL 

I«LIST( J) 

IF( I.CQ.NREP)   GO  TO  55 
■50  CONTINUE 
55         IF(J.EO.l)   GO   TO  80 
60  SREF(l)«BIG 

J1»J-1 

DP  70  L«l.Jl 
LTSTL-LISTf L! 
LL«LPIND(I#LISTL) 
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S»ecx»S(LD/tNU«1BR(I)*NUMBR(LlSTL)) 

IF ( < < SR£f X-SREF ( I ) ) *SIGN) .GE.O. >   GO  TO  70 

NF4RU )«LISTL 

Sk£F(  D-SREFX 
70  CONTINUE 
80  J=J*1 

IF(J.GT.NCL)  RETURN 

T«LIST( J) 

IF (NEAR (  I  )  .  EO  .LR EF .OR .NEAR ( I ) .EO. NREF )   GO   TO  60 
GG  TO  SO 
E  NO 

*ECR 

SUBROUTINE    M  ETHQD(S. NEAR., SRFF#LIST,NUM  SO, SUM, SREFX, SIGN*  NtNCL* 
ALRtF.NREF, JOB) 

C 

C      HIERARCHICAL   CLUSTERING  BY   MINIMIZING   THE    AVERAGE   DISTANCE  OR' 
C      MAXIMIZING  THE    AVCRAGE   CORRELATION   WITHIN    THE    NEW  GROU°.      THAT  IS. 
C      C0R    EACH   POTENTIAL    MERGE    THF    AVERAGE    OF    ALL    LTNKAGFS   WITHIN  THE 
C     NEW   GROUP   IS  CALCULATED. 

OI-FNSION    5  (  1  )  ,  NE AP  ( 1  )  »SREF ( 1 ) , LI CT ( 1 ) . NUMB* ( 1 ) » SUM ( 1 )  ' 

IF    {    JOB   -   2    )    1C»    25,  30 
C      JOB-1.  INITIALIZE. 

C      NUMBR  (  I  )  "NUMBER   CF    ENTITIES   CURRENTLY   TN   THE    I-TH  CU'STP'H 

C      SUM(I)«SUi  OF   ALL   PAIOWISE   SIMILARITIES    AMONG   ENTITIES    IN    THE  I-TH 

C  CLUSTER 

10  WRITE(6,2000> 

2000     F0RMATO7H0AVEPAGF    LINKAGE   WITHIN   THE   NEW  GROUP) 
DO  20  J»1,N 
NUMRR(J)»1 

20         SUM ( j ) -o.  |  ■ 

9IG»SIGN*1 .E50 
RETURN 

C      J06«2,    DUMMY    ENTRY.  1 
2  5  RETURN 

C      JOB-3.    UPOATE    F  3  R   NEXT  "OUND. 
C      U°OATE   THE   NEW  CLUSTER 

30  NUMBR(NREF)»NUMBR(NRFC)+NUMQR(LREF) 
LN»LPIND( L  R  E  F • NP  E  F  ) 

SUM(NREf ) «SUM (NREF ) +SUM( LREF ) +S (L  N ) 
C      MPJATE    ENTPTES    IN    THE    REDUCED    SIMILARITY   MATRIX.      THE    EN  TR  I  E  c  ARE 
C      THE    SUM   TJTAL    OF    SIMILARITY    VALUE*    ASSOCIATED   WITH  ALL 
C      PAIRWISE   LI*KS    3PT^PEN    THE    ELEMENT?    OF   THE    TWo  CLUSTERS. 

DO  4  0  J»1»NCL 

I «  L  I  S  T  (  J  ) 

IF  (  I  .  EO.NREF  )    0,1   TO  40 
C      RECALL    THAT   LREF   HAS      F  e  N   REMOVED   FROM   LIST    AND    TH  F  R  E c  OR  F    I    NEED  NOT 
C      6E    TEST  Ed    c  T  R    EQUALITY   WITH   LREF,  1 

L  L  =L  F I ND (  I, LREF) 

LN»LFINU(I»NREF) 

*(LN)«S(LN)*S(LL) 
40  CONTINUE 

C      U°D  AT  F   THE   NEAR    AND   SREC    ARRAYS.      I F    THE    EXTREME    EL  EM.  FN  T   TN    P^w  T 
C      WAS    EITHER   LREF    OR    NREF.    THEN    IT   IS    NECESSARY    TO   FIND    A   NEW  FYTREME 
C      ELEMENT.      ROWS    PRIJR    TO   NR^c   NFEO   NOT   "E  CONSIDEPFD. 

DC   50  J»1,NCL 

I*L  I  ST ( J  1 

IF<  I. EO.NREF)    GO   TO  55 
50  CONTINUE 
55         IF(J.EO.l)    GO   TO  30 
60  SRfcF(I)-BLG 

J1»J-1 

DO   70  L  -  1 , J  1 

LISTL*LI>T(L) 

L  L  » I F  I  SO ( I,LISTL )  , 

NTOT*NUMBR(I)  +Nij  MB  P  (  L  I  S  T  L  ) 

NTQT=(NT0T*(NT0T-1) ) /2 

SREFX»(SUM(I)+SUM(LISTL)+S(LL))/NTOT 

IF ( (  ( SREFX-SRFF ( I)  ) *S  IGN ) ,GE  .0. )    GO   TO  70 

NEAR  (I)»LlSTL 

SP  E  F  (  I  )  »  S  R  EF  X 
70  CONTINUE 
10  J  =  J  +  1 

IF(J.GT.NCL)  RETURN 

I»LIST( J) 
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IF(NeA5< I) .EQ.LREF.OR.NEAR(I).EO.NRFF)   GO  TO  60 

GO   TO  30 
END 

*?0R 

SUBROUTINE   » E THOO ( S . N E AR » SR E c, L I S T > A » * » SkE F * . S T G N» N» NC L » L R E F , N9 E F * 
A  J  OR  ) 

C 

C     HIERARCHICAL   CLUSTERING   «Y   THE   MEDIAN  MFTHOO  OF 

C         GOWER,    J.C.*    A   COMPARISON   OF   SOME   METHODS   OF  CLUSTFR  ANALYSIS* 
Z  3I0MPTR  ICSt    VOLUME   23*    nijmbER   4,   OECEMBER   1967*   PP  623-637. 

C 

C     THE  PARTICULAR   ALGORITHM  USEO  H5RF   IS  DESCRIBED  IN 

C         L  ANC  t »   G.N.    A  NO   W.T.    WILLIAMS*   A  GENERAL   THEORY  OF  r  L  AS  S I  «=  IC  A  TOR  Y 
C         SORTING   STRATEGIES*    I.   HIERARCHICAL   *YSTE"S»    THP   CO*PI'TFR  JOURNAL* 
C         VOLUME   9.    NUMBER         FEBRUARY    1967.  PP373-3«0. 

r 

DIMENSION  S(1).NEAR(1),SREF(1),LIST(1),A(1),3(1) 
,    IF    (    J0«   -  2    1    10*    15*  20 
C      JOB-1.  INITIALIZATION 
10         WRITE(6»  2000) 

2000     P  0  R  M  A  T  (  4  A  H  0  *  E  D I  A  N  METHOD  OP   GOWE  R  »    BEWARE   OF  REVERSAL) 

BIG«SISN*1,E50 

RETURN 
C      J0B*2,    DUMMY  ENTRY. 
15  RETURN 

C      JJfl«3»    U°OArE   FOR   NEXT  POUNO. 
20  I BCT«LPINO( LREF, NREF ) 

DO   30  J«l»NrL 

I-LIST( J ) 

IF( I  .  EO.NPEF )   GO  TO  30 
C     RECALL   THAT   LPEF   HAS   BEEN  R  E  M0  VEC  FROM   L I ST   SO   I   NEED   NOT  BE 
C      TFSTFO   FOR    ?OUALITY   WITH  LR£F. 

LL»LFINr>(  I.LREF) 

LN«LF IND( I »  NR  FF ) 

S(LN)»(S(LN»*S(LL))/2.-S(LBET)/A. 
30  CONTINUE 

C      UPDATE   THE    NEAR   AND   S  R  FF    ARRAYS.      IF    THE   EXTREMF    FLEMENT    TN  RTW  I 
C     VAS    EITHER   LREF   OR   NREF,    THEN    IT   IS   NECESSARY   TO  FIND   A  NFW  extreme 
C     ELEMENT.     ROWS   PRIOR   TO  NRcp   NEED  NOT   BE  CnNSIDEREC. 
<iO         OP  50  J«1.NCL 
T«LIST( J) 

IF(I.fcO.NREF)    GO   TP  «=5 
50  CONTINUE 
55         IF(J.EO.l)   GO   TO  60 
60         SPEF( I) "SIP 

Jl *J-1 

00  70  L«1»J1 
LI STL-LIST(L  ) 
LL*LCIND ( I , L I  STL  ) 

IF ( ( ( S( LL ) -SREF ( T ) ) *S IGN) .G€ .0. )    GO   TH  70 

NFAP  f I ) *  L I i  TL 

S  h  EF (  I)-S(LL  ) 
70  CONTINUE 
B 0  J=J*1 

IF(J.GT.NCL)  RETURN 

I«LIST( J) 

IF(  NE AR ( I ) . EO .LRFF .00 .NEAP ( T ) .FO. NRFF  )    GO   TO  60 
GO   TO  "0 

End 

*  ehr 

subroutine  methqd( s  .near, s& e f » l i s t, num« r . sum , sr e f x , s i gn , n , nc l » 
alrff.nrff*  j03) 

C 

C     HIERARCHICAL   CLUSTERING  BY  CENTROID  SORTING 

C 

C     THE   PARTICULAR   ALGORITHM  USEO  HERE   IS  DESCRIBED  IN 

C         LANCE.  G.N.   AND  W.T.   wILLUMS*   A  GENERAL   THEORY  OF  C  L  A  S  S  r  F  I C  A  TOR  Y 
C         SORTING   STRATEGIES.    1.   HIERARCHICAL   SYSTEMS*    THF   CCP'-'TFR  JOURNAL* 
C         V3LU«e   i,    MIM3ER   <t»    FEBRUARY   1967,  Re>373-3B0. 

C 

DIMENSION  S( I ) »  NEAR ( 1 ) »  SR  EF ( 1 J . LI  ST ( 1 ) . NUM9R ( 1 ) . SUM( 1 ) 
TP    (    JOB   -  2    )    10.    »c»  30 
C      Jn3«l»  INTTIALI7F. 

C     NUM3R (I )«NUMBfR   OP    tNTIIIES  CURRENTLY   I N   THE   I-TH  CLUSTER 
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C  CLUSTER 
10  WRITE<6»?000) 

20C0     F0RMAT(*2H0CENTR0ir)  CLUSTERING.     PEWAR E  OF  REVERSALS ) 

DO  20  J-l.N 
20  NUM8R(J>«1 

BIG«SIGN*1 .E50 

RETURN 
C      J09«2,   DUMMY  ENTRY. 
25  RFTURN 

C     JOB-3,   UPDATE   FQR  NEXT  ROUND. 
C     UPOATF   THE   NEW  CLUSTER 
30  NTOT«NUMBR(NREF)+NUMBR(LREF) 
TOT-NTOT 

ALL»NU«BR( LREF) /TOT 
ALN-NUMflR (NREF ) /TOT 
NMBRMREC  )»NTOT 
PROO-ALN*1  ALL 
L°ET-LFIND(LRFF.NREF) 
DC  <*0  J-l  »NCL 
I»LTST{ J) 

IF(I.EO.MREF)   GO  T"  40 
C      R  F" ALL   THAT   LREC   HAS   "FEN  REMPVFD  FR]M   LIST   *ND   THEREFORE    I   NEED  NOT 
C     ^  E   TESTED   FOR    F0UAL1TY   WITH   L  R  F  F , 

LL»LFIND(I»LREF) 

L^»LFIHO(I»N«EF) 

S(LN)«ALL*S(LL)+4LN*S(LN)-DR0D*S(IRET) 
40  CONTINUE 

C     UROATE   THE   NEAR   AND   SR  EF    ARRAYS.     TF    THE   EXTREME   FLFMENT   TN   ROW  I 
C     WAS   EITHER   LREC   Oft   NREF,    THE  N   IT   IS. NECESSARY   TO   FIND   A  NlW  F*TP£ME 
C     ELEMENT.     ROWS   D' T JR  Tfl  NREF  NEED  NOT  BF  CONSIDERED. 

DO  50  J«1»NCL 

I »L I S  T ( J  ) 

IF( I.FQ.MRCF)    RQ   TO  55 
50  CONTINUE 
55         IFU.EO.l)    GO  TO  bO 
60  SREF(I)««TG 

Jl-J-l 

DO  70  L"l»Jl 

LI STL«LIST(L  ) 

Lt  «L F I ND ( I.LISTL) 

IF  (  (  (S(t  L  )-SREF  (  I )  )*SIGN)  .GE.O.  )    C-CJ   TO  70 

NFAR(I)«LISTL 

SPEF  ( I  )  »S(LL ) 
70  CONTINUE 
°0         J  * J  +  1 

TF(J.GT.NCL)  RETURN 

I »L I S  T ( J  ) 

I F (NEAR ( I ) . EQ.LREC .0' .N= 4R ( I ) .EQ.NREF )   GO  Tn  60 

GO   TO  10 

END 

+  ETR 

SUBROUTINE   METH3D(S»NFAR»*REF,LIST»NUMBP»SUM,  SPEFX,STGN,N,NCL» 
ALP CP» NREF. JOB) 

C 

C     HIERARCHICAL   CLUSTERING  BY   TH?   METHOD  OF 

C  V  A  R  D  #    J.H..JR,   HIERARCHICAL   GROUPING   TO  OPTIMISE   AN  OBJECTIVE 

C         FUNCTION,    JOURNAL   OF   THE  AMERICAN  STATISTICAL   ASSOCIATION,  VOLUME 

C         58,    1963,    PP  ?36-244. 

r 

C     THE   PARTICULAR   ALGORITHM  USED  HERE   IS   DESCRIBED  IN 

C         WISHART.   0.,   AN   ALGORITHM   FOR  HIERARCHICAL  CLASSIFICATIONS. 

C         BIOMETRICS,    VOLUME  22,   NUMB FR   1,   MARCH   I960,    PP  165-170. 

C 

OT MENS  I  ON  S(l  )»NEAR(1)»SREF(1),LIST(1),NUM8R(1),SUM(1) 
IF   (    JOB  -  2   )   10.   25,  30 
C     JOB-1,  INITIALIZE. 

C     MUMBR ( I ) "NUMBER  OF   ENTITIES  CURRENTLY   IN  THE  T-TH  CLUSTER 
10  WRITF(6,2000J 

2000     FCRMAT ( 44H0H IER  ARCH IC  AL  GROUPING  BY  THE  "ETHOD  PF  WARD) 

DO   20  J»1,N 
20  NUMBR(J>«1 

WO. 

8IG-SIGN*1.E50 
RETURN 
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C     JOB-2,   CALCULATE   OBJECTIVE   FUNCTION  VALUE 
25  W-W+SRFFX/2. 

SR  EFX«W 

RETURN 

C     JQB»3,   UPDATE   FOR   NEXT  POUND. 
30         LPET*LFTND(LREF  » NR  EF ) 

NTaT»NUMSR<L«Ec)*NlJMRP(NREF) 

DO  <,0  J»1,NCL 

I -L I ST( J  1 

IF( I . EQ.NREF )    GO  TO  <,0 
C      RECALL    THAT   L  P  F  F   HAS   B  F  FN   REMOVED   FRO*    LIST    SQ   I    NEED   NOT  BF 
C      TESTED   FOR   ECUALITY   WITH   L  RE  F , 

LL»LFIND( I#LREF) 

LN»LFIND( I »NREFJ 

S(LN)«(S(LN)*(NIJMOR(T)+NU1BR(NREC))*S(LL)*(NUMOP(I)  *NUMB 5  (  L  R  E  F  )  )- 
AS (L BET) *NU*9R( I ) J / (NTQT+NUMBR{ I) 1 
<.0  CONTINUE 

NUMRR (NREF ) «NTOT 

C      UPDATE    THE   NEAP    AND   SREF    ARRAYS.      IF    THE    EXTREME    ELEMENT    IN   ROW  T 
C      WAS    EITHER   L  R  E c   OP   NREF,    THEN    IT    IS    NECESSARY    TO   PINO    A   NEW  EXTREME 
C      ELEMFNT.     ROWS   PRIOR   TO  NREF   NEFO  NOT   BE  CONSIDERED. 

00  50  J«1»NCL 

1  »L  I  S  T  (  J  ) 

IP  (  I .  EO.NREF )   GO  TO  55 
50  CONTINUE 
55         T  F  (  J . cQ . 1 )    c0   TO  °0 
60  SREF(l)«RIG 

Jl* J-l 

CO  70  L-1»J1 

LISTL»LIST(L  ) 

LL  «L  F I N0( I, LISTL) 

IF ( ( { S (LL ) -SREF { I ) )*STGN) .GE .0. )    GO   TO  70 

NE AR ( I ) «L ISTL 

SREF(I)»S(LL) 
70  CONTINUE 
80         J  *  J  + 1 

IF(J.GT.NCL)    9 FTURN 

I  ■  L  I  S  T  (  J  ) 

I F (NE AR ( I ) . EQ .LREF .OR . NE AR(  I )  .EO .^REF  )    GO   TO  60 

GO   TO  "0 

END 

♦  FOR 

PROGRAM  CLUSVEC ( INPUT   «   203B,   0"TPIJT   «   203P,    TAP  E 1  »  TAPES* 
1    TAPE5   *    INPUT,    TAPE6   *  OUTPUT) 

C 

C  TH^S    PPQGPAM   PRODUCE*    A  COMPLETE    PAIPWTSF   DISTANCE    VFCT3R    FPPM  THE 

C  STEP    "Y   STEP    MERGE   INFORMATION   PROVIDED   BY    THP   CLUSTER  PROGRAMS. 

C  THF   DISTANCES    C'lTPUT   ARE    THE    FINALIZED   DISTANCFS    B  E  T  W  F  EN   CASES  AS 

C  DETERMINED    BY    THE   CLUSTERING  ALGORITHM. 

C  FILE    1    IS    THE    INPUT   FILE   CONTAINING   THE    MERGE    !MFnp*UinN  AS 

C  PROVIDED   BY    TMt   CLUSTERING   0 R OG RAM, 

1  THE  CORRESPONDENCE   0  E  TWE  EN  THE   NAMES   USED   FOP   THE   INPUT  VARI- 

C  A°  L  E  S    IN   THIS    PROGRAM   AND   THEIR   NAMES    IN    THF   CLUSTER    PROGRAM  OUT- 

C  PUT   STATEMENTS   IS   AS  FOLLOWS... 


C  IDAT  (  1,    K>  K 

f.  I  0  A  T  (  2  ,    lO  I(K) 

C  TDAT(3,    K)  j(K) 

C  ID  AT ( 4  »    K)  S(K) 

C  IDAT  (  5,    K  )  IL  (  K  ) 

C  ID  AT  (6.  '  K  )  JL(K) 


C  NOTE. . .  I0AT<4,    K)    CONTAINS    A  RFAL    OUANTITY.      IT    TS,  HHWEVEP, 

C  ONLY   READ   AND    WRITTEN,    AND   NOT  USED   IN   ANY  COMMUTATIONS. 

C  FILE    d    IS    THE  OUTPUT  FILE    WHERE   THF    PATRWISE   D I c  T A  NC  F 

C  VICTOR    IS    WRITTEN.      THIS   FILE  CONTAINS    ONE    LINE    PER    PATS,  EACH 

C  LINE   CONTAINING   THE    SEQUENCE   NUMBERS   OF   THE   CASES    IN   THE  PATR, 

C  AND    THE   DISTANCE   FOR    THAT   PAIR,    UNDER   THE    FORMAT    (  2  I  5  .  P  1  2  .  7  )  . 

C  NOTE    THAT   THIS   FILE    TS   NOT   SORTED   AS   WRITTEN   "Y   PROGRAM  CLUSVEC. 

C  THE   FILE    IS    SORTED    BY   A   SYSTEM  SOPT   »OUTTNE,    CALLED  S0PT/MP9GE 

C  AT   THg   UNIVERSITY   OF    WASHINGTON   CDC   6000   INSTALLATION.      THE  FILE 

C  IS    SORTED   SO   THAT   THE   O'DEP    OF   THE   VECTOR    IS    IN   THE    QROES  A 

C  SEQUENTIALLY   STORED   LOWER   LFFTHAND   TRIANGULAR   MATRTX,    I.E.  THE 

C  PAIRS   ARE    IN   ASCENDING   ORDER    AND   THE   FIRST   E  L  EM  F  NT   OF    THE   °AIP  IS 

C  OF    HIGHER    SIGNIFICANCE   THAN   THF  SECOND   ELEMENT  0  F    THP  patp. 

C  THIS    PROGRAM   USES   DYNAMIC    STORAGE    ALLOCATION,    A  N  0   RPAOS  THE 
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C  NUMBER   OF   CASES   FROM   THE    INPUT   (OR   RC°AR)  FILE. 

C  PROGRAM   WAS   WRITTEN   8Y   R.L.    FLEWELLING,    UNIV.    Qc   WASM.,    and  MODT- 

C  FTtD   BY   RQ"ERT  3.   L  cO  I NGH AM »   UNIV.    OF  WASH. 

C 

COMMON    IDAT(8,  1) 
DATA   L*EM   /   <tL  M  E  MP  / 
READ   (5,    200)  NE 
200   FORMAT(  4X,  I  <, ) 

MEMORY   »    ISH  I  FT ( LOCF ( 10 AT ( i  »   1))    +   9   *   N  F  #  30) 

CALL   RAOKLMEM   ♦   L  OCF  (  MF  mqr  y  )  ) 

NC-NE-1 

00   5    1*1, NC 

READ    (1,    1C)    (1DATU,    I).    J    *   i»  6) 
10  F0RMAT(3I5»cl6.3,2I5) 
5  CONTINUE 

20  CX»NC 

21  00   23    I   *    1.  NE 
IOAT  f  7.    I )    «  0 

25  IDAT(°.    I  )    «  0 

IGRP-1 
NOW  I  *1 

T0AT(7.    IGR°  )  «IOAT  (  2»  CO 
IF(TDAT(5»   CXJ.EO.O)   GO  TO  50 
NCM-CX-1 
28         On   30  NLM»1,NCM 
NL  *NCM-NLM+1 

IF(I0AT(2,   NL)  .NE.TQAT(7,    NOWI))    GO   Tn  30 
IGRP-IGRP+1 

IDAT<7,    IGPP)»TOAT( 3.    NL ) 
30  CONTINUE 

NOWl-NQWl+1 

IF(IDAT(7,    NOWTJ.EC.O)    GO   TO  50 
GO   TO  28 
50  JGRP«1 
NOWJ*  1 

TD AT( 9 i    JGRP)«TTAT(3»    CX ) 
I F ( I D A  T ( 6.   CXJ.FO.O)    GO  TO  100 
NCM-CX-1 
55  00   bJ  NLMxl.NCM 

NL  =NC*-NLM+1 

IF(IDAT(2»    NL).NF.IDAT(3»    NOWJ))    GO   Tn  60 
JGSP- JGRP  +  1 

IOATC.    JGPP)  «IOAT(  3»    NL  1 
60  CONTINUE 

NCWJ=NQWJ*1 

IF(I0AT(3»    NOWJ). EG. 0)    GH   TO  100 
GO    TO  55 
100       00   80  I»1.NE 

IF(IDAT(7,    D.EO.O)    GO   TO  95 
00   3  5  J»ltNF 

IF(I0AT(8f    J).EO.O)    GO   TO  ''O 
M1»I0AT(7»  I) 
M2»IDAT(3»  J) 
MN»MINO( Ml, M2 ) 
MX*MAX0(M1,*2) 

WRITE<9,91  »M.X,MN,IDAT<4,  CX) 
91  F0RMAT(2I5,F12.7) 
85  CONTINUE 
30  CONTINUE 
95         CX    »  CX   -  1 

IF(CX.GT.0)G0   TO  21 

STOP 

END 

*E  OR 

PROGRAM  CORST0<TN»UT  «   203B.    TAPS?   «   1^C3B»   TAP  E 1   •   1003".  TAPES* 
I    TAPE11.    TAPE12#    TAPE13,    TAPflA*    T  A  0  F 1  5  »    T A  P  Fl 6  »    TA°P17   *  1003*. 
?   TAPEW.    OUTPUT   «   203*,   TA  P  F  5   »   INPUT,    TAPEfc   ■  OUTPUT) 

DIMENSION   M  N  A  M ( 7 ) ,    00JL(8).    OOIN(IO),    OJS(«).    0JSS(8) ,    TX(1),    T(«  ) 
1    .    SCORRC.    3),    W(7),    AKX(l),  OJM(IO) 

COMMON  NS,    IS ( 2.    12),   KX(20A7>»    K VM,  X(l) 

EQUIVALENCE    (  IX (  1 )  .    X <]  )  )  ,    (  A  KX ( 1  )  ,    KX(  1) ) 

DATA   OJS.    OJSS    /    16    *   0.0  / 

DATA   LMEAN,    LSD,    L"000,    L*E*P>    I3L    /    <H  ME  A  N  »    7HSTP   DE  V ,  <.M«*000, 
1   A  L  M  E  M  0 ,    1H  / 
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INT5P(Z)    «   MIN0(*AX0{IFIX<SIGN(A9S(Z>   *   1.E5   ♦  0.5»  7))p 

1  -9O0O000)*  9900000) 

from   FRES   11. .NPETH   «■   10  GET  PAIRWIS*   JOINING  DISTANCES   F  n  R   Th  = 
NMETH  METHODS.      FROM  FILE   1   GET  ORIGINAL   DISTANCE   MATRIX.  DO 
SPEARMAN   CORRELATIONS    AND   WRITE  COMPOSITE   HI  STANCE    M  A  TR  T  X  TQ 
PILt   2,    WHERE   COMPOSITE  DISTANCE   IS   WEIGHTED   SU"  0 F  JOINING 
DISTANCES    AFTER   STANDARDIZATION*    WITH  THE    WEIGHTS   THE  SOUARED 
CORRELATIONS   BETWEEN  "RigINAL  DISTANCE  AND    JOINING  DISTAN^S. 
IF    SPECIFIED.    WRITE   CORRELATION   MATRIX    AND    STANDAR0I7E0  DISTANrF<: 
TO  FILE   a.      FILE  47   IS   SCRATCH  frc,      NMETh   SHOULD   ^F-<   6.  AND 
CORRELATION   MAY   P.  E   PERFORMED   CN   SAMPLE    OF    <    ?0<.°    SETS   OF  DISTANCE 
TO    INCREASE   s  a  M  9  L  E    SIZE   a  EYONO   2047,    SEE   INSTRUCTIONS   IN   S.R.  SORT 
THIS   PROGRAM    WAS  WRITTEN   8Y    ROBERT   q .    LEOTNGHAM,    L'NTV.    OF  'JASH. 

RFAD   DARAMETEPS»    GET    S"ACE    FOR   DATA    FOR  CORRELATION. 

FOR    SAMPLING    DARAMETFP   SAM,,, IF    <;  a  M    ,LE.    0.0.    SET   SAM   T  3   I'SF.  A50UT 
600   »AIRS.      ADJUST   SAM  DOWNWARD    I  F   SOT    ENOUGH   MEM0'Y    «  V  A  T  I.  A  q  L  E  . 
START   OFc    ev    ALLOCATING   ENOUGH   "E'-ORY   FOR    A  R  OUT    1000  VA1IJES. 

CALL  SECOND(TC) 

READ    (5»    200)    NH,   NMETH,    S  A  * ,    ISJLt  MNA* 
ISJL    -   MINO(IA«S(ISJL)»  1) 
MM    x    ?   -  ISJL 

IF    (    NMETH    ,LT.    1    )    NMFTH    »  6 
NMP1    «   N"ETH    +  1 

00  2  0   M  *    l,  M-F.TH 

IF    (    MNAM(M)    ,E0.    ML    )    MNAM(M)    *    LM000    +    ISHIFT(m,  3fr) 
WPITF(6,    210)    NH,    SAM,    ISJL.    N M  E  T w  »    MN  A M  »  TO 
F  NN    »    NN    *    (NH   -   1)    *    NH    /  2 
CNNM    »    FNN    -  1.0 

IF    (    SAM    ,LE.    0.0    )    SA*    «   600.0    /  FNN 
SAM   >    A*IN1 { S A*,  1,0) 
FNVAL    »    PLCAT(NMPl)    *  FNN 

...GET   JDUARO   maximum   MEMORY   ALLOW ABLC    IN    MAVAR.      S I R  01 1  T  I  N  F 
...RAPl    OFPFORMS    SYSTEM   CALL    T 0    00   R  ou  T I N  c  MEM. 

MAVATL    *    ISHTFT(-1,    30)    .AND.  MA5M30) 

CALL   RAPKLMEMP    ♦  LOCF(MAVAR)  ) 

MAVAR   »    ISHT  FT  (M  AV  AR  »    -30)    -  LnCF(X(l)) 

Ic    (    MAVAR    .  GE  .    I^IXCFNVAL   *    SAM   #    1.1)    )    GO   TP  30 

SAM   *    FLOAT  (  MAVAR  )    /    (  c  N  V  A  L    ♦  1.11 

W°TTE(6»    211)  "AVAR 

is\mp  *  sam  t  p.ni 

vrcq    «    ( ( MI  NO  (  I  F I  X ( FNVAL    *    SAM   *   0,9),    1063)    ♦    LOC F ( X < 1 ) )» /64 ) *64 

. .  .  MREQ    IS    TOTAL    RECUIRED    cIcLD   LENGTH    (FOR    NOW),    N  X  A    IS    AVARA«L  = 
...MCMqRr    FOR    t    ARRAY    (  *  A  T  A    cqp    £  n  R  fl  c  (_  K  T 10  N  S  )  .       USE    RAPT    AGAIN    Y 0 
...GET   mpm3ry.      INITIALIZE   R  A  N  DO y   wUM"FR  G  ENFR ATOp  . 

N  X  A    *    MRFQ   -    LOCF(X(D)    -  NMP1 
M  E  M  J  R  Y    *    TSHIFT(MREO,  30) 
CALL    RAPKLMFMP   ♦   L  OC  F  1  M  E  MOR  Y  )  ) 
C'LL  RANSET(TO) 
NXC   *    NS    «  0 

1  =  11 

READ   DATA,    SAVE   ON   FILE    47.    ALSO   SAVF   Sample    in   MEMORY,  GETTING 
MORE   MFm-)5y    if    NEEDED    I N    512-WORQ   CHUNKS  •      ACCUMULATE    SUM   AND  SUM 
uF    SOUARcS    FOR    EACH   DISTANCE   FOR    ALL  CASES. 

DO   75   J    *    If    N N 
1-1*1 

IF    (    I    ,LT.    11    )    GO   TO  40 
READ    (If    250)  COIN 
I    »  1 

ODJL(l)    -  ODIN(I) 
DO   50  M    =    1,  NMETH 
K    »    M    +    1 0 

R  F  AO    (K,    230)  0DJL<*>11 

WRITE(<»7)    (  nD  J  L  (  M  )  ,    M   ■    mm,  NMPH 

IF     (     TSAMP     ,NE  a    0    )     GO    Tn  55 

IF    (    RANC(O.O)     .Gc.    SA*    I    GO    TO  7Q 
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55 


57 


59 


50 


65 
70 

75 


IF    (    NS    .GE.    '047    )    GO  TO  57 

IP    (    NXC    .LT.    NXA    )    GQ   Tn  60 

NXA   »   NX  A   ♦  512 

M  R  E  Q   »   MREO   ♦  512 

IF    (    NXA    ,  L  E  .    M  A  V  A I  L    )    30   TO  58 

WRITE(6»    213)   NS,    J,   NN»   MR  EO 

SAM    .  -1,0 

ISAM"   »  0 

Gn   TO  70 

MEMORY   ■    1 5  H I  FT ( MR  FQ »  30) 

CALL    RA.PKLMEMP   ♦   L  OCF  ( MEMOR  Y  )  ) 

NS   *   NS   ♦  1 

DP  65   1  «    1»  NMP1 

Nxr    *   NXC    ♦  1 

X(NXC)    -  OOJL(M) 

DO    75   1    a    1,    mm  0  1 

OJS(M)    «    ODJL(M)    ♦  OJS(M) 

OJSS(M)    *    OOJL(M)    *♦   2   +  OJSS(M) 


HAVE    DATA   PHR   CORRELATIONS    IN   M1..NMP1,    1..NS).      S3EAPMAN   C  n  RR  E  L  - 
A  T I  ON  PROCEDURE   USES   FORMULAS   FRO"  SPSS   MANUAL »   2ND  EOITtqv, 
PP  239-290. 

START   CORRELATIONS    BY,    POR    FACH   SET   OF    DISTANCES.    REPLACING  VALUES 
3Y   ORDINAL    RANKS,    KPEPING   TRACK   OF   CORRECTION  POR    TIES    (T).  IN 
THE   CASE   OF    TIES,    ASSIGN   AVERAGE   RANK   TP   EACH  TIED  ELEMENT. 
RPCALL    THAT   X    AND    IX   ARRAYS   AfcP   FCUT  VAL  PNC  ED,    AND   KX    AMD  AKX 
ARRAYS    ARE-  E  0  U  I  V  AL  F  NC  E  D  . 

NNNS    ■    (NS    **   2   -   1)    *  NS 

NS  M    «   NS    *  NMP1 

PRINT   212,    SAM,    NS,  MREQ 

M 0   »  -NMP1 

00    100  M   a   lt  NHP1 

MO    «  *0   ♦  1 

K   »  M 

ITP    a  NNNS 

...MAKE  APPAY  KX...KXU  *  1..NS)  «  IX(M,  T  *  1..NS)  WITH  L^W-ORDER 
...12   ^ I T  5   HF    KX(I)    RPPLACtO    BY   VALUE    I.      THEN   SORT  KX  ARRAY. 

00    85    I    >    1,  NS 

<X(I)    -    IX(K)    .  A NO .    MASK(4»)    .OP.  I 
0  5    <    *    <    *    N  M  P  1 
CALL  STRT 

...RP°LACF    X(M,    1..HS)    *ITH   ORDINAL    RANKS  •      ITI    IS    NUMUFR    n  P 
...INSTANCES   OF    VALUE    RCO.      LCOKING   AT   PACH    E  L  F  w  EN  T   CP  KXtSOPTFO 
.  .  .ARRAY )    TN    TURN. . . 

ITI    «  0 

RCO   *  -1.E99 

00   05    I    »    1,  NS 

XI    a    AK  X (  I )    .AND.    MASK (49) 

IF    (    PCH    .LT.    XI    )    GO    TO  90 

...VALUE    SAME    AS    PREVIOUS.      INCREMENT  ITI. 

ITI    a    ITI    +  1 

GO   TO  95 

...VALUE   HAS    CHANGED.  P  OR    THE    T  T  j    ORPVIOUS    ELEMENTS,    0 U T  IN 

...(AVERAGE)    RANK    XR,  AND   I NCRC  MP  NT   T    ACCUMULATOR    TP   T  T I    >  1. 


91,  92 

77773)    *  NMP1    *•  MO 


90  RCO   *  XI 
IF    (    ITI    -   1    )  94. 

91  K   *    (KX(t-l)    . AND  1 
X (K )    «    I    -  1 

GO   TO  9*5 

q?    XR    «   FLOAT!?    *    I    -    IT  I   -   1  )    ♦  0.5 

DO    9  3    J    a    1,  ITI 

K  »  (KXLT-Jl  .AND.  77773)  *  NMP1  ♦  MO 
x  ( K  )    a  XR 

ITF    -    ITF   -    (ITI    **    2    -   1)    *  ITI 
94    ITI    »  1 
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95  CONTINUE 

C 

C  . . . RE  0  L  AC  E  FINAL   ITI   FL  EVENTS  WITH  SANK,   AS   ABOVE,   AND  COMPUTE  T 

C 

XR   -    FLOA  T ( 2    *  NS  -   ITI    ♦   1 )    *  0.5 
DO  97  J   »   1,  ITI 

K    -    (KX(NS   -  J   ♦   I 1    .AND.    7777P)    *  N"P1    ♦  MO 
Q7   X{K)    ■  XR 

IF   (    ITI    ,6E.   2    )    I TF   •  ITF  -  ( T T I   •*  >  -  1 1  ♦  TTI 
Tl"")    =   FLOAT(ITF)    /  1».0 

100  CONTINUE 

r 

C  COMPUTE  SPEARMAN  CORRELATIONS   FOR   EACH   PAIR. ..PRINT  CORR El ATIONS, 

C  AND   WRITE   THEM   TO   FILE   9    IF    ISJL    .N*.   0  . 

r 

V. 

00   120  11   «    1.  NMETH 
M21    ■  H   ♦  1 
SC0RR(M1.    ^1)    «  1.0 
DO   120  *2   .   M2M,  N1P1 
Kl   «  11 
RD   «  0.0 

Dn  110  K2   ■  12,   NSM,  NH»i 
R D   ■    (  X  (Kl  )   -  X (K2  )  )   *+  2   ♦  RD 
110  Kl  -   Kl   ♦  N1P1 

120   SC0RRM1,    1  ? )    «    SCORR  (  M2»    Ml)    «   (T(M1)    ♦   T(m2)    -  RO )    *  0.5 
1    /    SORTCT(Ml)    *   T(M?) ) 
SC0RR(NMP1,    NMOl)    .  1,0 

WPITE(6,   215)   MNA1.    (SC")RR(K,   1),   K   •   1,  N"i] 

00  130  1   »   2.  N1PI 

130   *RITE<<>»   216)    "NAI(M-i),    (SCr-RR(K,   1),   K   «  1,  NMPl  ) 

IF    (    ISJL    .EO.   0    )    GO   TO  140 

DO  135  1-1,  N1P1 
135   WRITER,    280)    (SCQRRfK,   -),   K  «   1,  NMPl) 

C 

C  GET  MEAN,    STD  OEV  FOR   OISTANCES.     C0R   EACH'CASE,    STANDARDT7E  VARI- 

C  A  B  L  E  S  »    WRITE   THCH   TO   P  T  L  E   R   IF   ISJL    .Nf.   0,   AND   WRITE  CH1P0STTE 

C  OTSTANCc   SCORE   IN   FORMAT    (10F10.6),    I.E.    10  CASFS   PER  LINE. 

C  FILE  2   IS   OUTPUT  FRF  FOR  COMPOSTTE  DT  STANCE  VECTOR. 

C 

1*0   DO   14  5   1   *   1 ,  N1ETH 

145   W(")   -   SCORR(M-H»    1  )   **  ? 

WP I TE ( 6,    217)    ISAM,    (W(1),    «   ■  1,    NMETH ) 

DO   153  1   «    1,  N*P1 

OJM(M)    »   OJS(M)    /  FNN 
150   OJSSd)    »   SORT  (  ( OJSS  (  M )    -  OJS(M)    *   hjmjn))    /   FNN1 ) 

WRirE(6.    216)    L1FAN,    (OJ«MM),   M   »    i,  Nfi>l) 

InRITc<6,    218)    LSD,    (OJSS(M),   M  -    1,  NMPl) 

REWINO  47 

1  »  0 

TF    (    ISJL    . NE .   0    )    GO   TO  170 

C 

C  ...ISJL   »   0...   OONtT  WRITE    Z-SC3R  FS .     BINARY   FILE  47  HAS  DISTANCES 

C  ...FOR   THE   N-ETH  1FTH0DS  ONLY. 

C 

DO   165  K   *    1,  NN 

READ    (47)    (ODJLC),    M   «    I,  N1ETH) 
WJL   ■    (ODJL(l)   -  OJM(?))    *  W(l)    /  0JSS(2) 
DO   160   M   -   2,  N*ETH 
160   WJL    «    (QDJL(I)    -  "Jt(Htl)|    ♦  W(M)    /   0JSS(M*1)    ♦  WJL 
I    »    I    *  1 
ODIN(I)    »  WJL 

IF   (    I    . L T ,   10    )   GO   TO  165 
I   -  0 

WRITE(?,    250)  OOIN 
165  CONTINUE 
GO  TO  190 

C 

C  ...ISJL    •  NE .   0...   WRITE   Z-SCORES   TO  FILE         BINARY  FILE   47  HAS 

C  ...ORIGINAL  DISTANTES   AND   JOINING  DISTANCES   <=0R   THE   NM  E  TH   "E THODS  • 

r 

170   DO   1»0  K   ■    1.  NN 

READ    (47)    OOJL(M),    M   »   1.  NMP1) 

KX»1)    •    INT5P«(0DJL(1)   -  0JM1>)    /  OJ^S(lj) 

DJ    »   (00JL(2)   -  0J*(2)>    /  0JSS(2) 
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WJL    »   DJ    *   W( 1) 
KX{?)    »  INT5P(OJ) 
DG   175   M   -   3.  NMP1 

DJ    «    (ODJL(M)   -  OJM(M))    /  OJSS(") 
WJL    «   OJ   *   W(M-l)    ♦  WJL 
175   KX(M)    -  TNT5P(DJ) 
1-1*1 
ODIN(I)    -  WJL 

I  F    (    I    .LT.    10   )   GO   TO  1^0 
I    -  0 

WRITP(2»    250)  ODIN 
130   WRITFO,    282  )    K »    (KX(M),    M    ■   1 »  NMPl) 

...WRITE  LAST   LINE    OF   CJ^POSITE   SCOPES   TO   FILE  2,    T  F 

...NH   *    (NH   -  1)    /    2   *0D   10    .NF.    0.    AND   PRINT  El»PS£D   T  t  M  E  *  • 


IPO   IF    (    I    .ME.    0    )    WR1TF(?»    250)    (ODTN(K),  K 
CALL  SCCONO(Tl) 
T10   ■   Tl   -  TO 
WRITE  (6,    215)    Tl,  T10 
STOP 


I  ) 


200  F0«fUTUXiK>flZX»I*,F*.2>*X»I4»<(X,7A4) 

710  FQRMA  T  ( 21H1NUMPER  OF  ENTITIES  -IW35H  SAMPLING  PAPAMETE'  (INPUT  VA 
1LUE )  -F5.2M6H  STANDARDIZED  JOINING  DISTANCE  OUTPUT  OPTION  -T4.22H 
2  NUMBER  PF  METHODS  -IW20H  NAMES  OF  ME  THODS  .  .  .  7  A  6  /  24H  FLAPPED  Tl 
3ME   AT   START    -F10.3.BH  SECONDS) 

211  FQRMAT(50 HO SAMPLING    "ARAMFTPR    ADJUSTED    TO   FIT   JOB  CARD   CM   »  06) 

212  FOR*IAT(  26H0S  AMPLI  SG  PARAMETE"  USED  -F5.2.7H  GIVTNGI6,17H  &AIRS  R  E  0 
1UTRING  06,164    WORDS   OF    MEMPP Y  ) 

213  FORMAT  ( 36H-*+*  WARNING  •  •  .MEMORY  1VERFL0W  AF  TCR  if-  »15H  CASE*  SAVED.. 
1./20H   ***  CASES   SAMPLED   -T8»3H  OF  1 8 » 19H . . . MEM  REQUESTED   -  06) 

215  FORMAT  (3BH-CQP.HCNE  TIC    (SPEARMAN)    CORRELATIONS  •  •  •  /lH08X8HO,g  , 
1   7&8/5H00.0.3F8  .  <t ) 

216  FORMAT ( IX t  A4. «F3 .4  > 

217  FnRMATdH-llX.PHO.D.  7  A  8  /  7H0W  E  I  GH  T9  X  ,  7f  P  .  <,  ) 

218  F0RMAT(1H0A7»«F8.4) 

21P   FORMAT  ( 25H-EL  APSED   TT"E   AT   FINISH   «Fll.3.«»H   SEC0N0S/26H   T  T M  E   FpR  T 

1HIS   f>ROCcDURE   «P10.3,PH  SECONDS) 
230  F0RMAT(10X,F12.7) 
250  F0RMAT(10F10.6) 
2B0  FCPMAT(8F10.7) 
282   FQR  MA  T ( 16, 10X.« [8 ) 
END 


*  E  0  R 


SUBROUTINE  S0"T 
COMMON   NS,  IS(2. 
PEAL    KX  ,  KXM 
INTEGER  c 


!?)»    KM?0<»7)»  KXM 


SOBT    ARRAY   <X(i..NS)    TNTn    ASCENDING  PR0E9   USING   OUICKSORT,  ALGOR- 
ITHM  7. A    OF   RFiNGOLD,    N  lcVcRGFLT,    AND   DFO,    MODIFIED   TO  DO   pR  D I N  AS  Y 
BUBBLE    SOPT   ON   SUBARRAYS   OF    <   9  ELEMENTS. 

IS(2,    12)    IS    STACK,    WTTH   POINTER    I S 0 •      MAXIMUM  ARRAY   ST  ZE  IS 
2047,    WHICH   CAN   Be    INCREASED   PY   INC»CASING   DIMENSION  OF  KX 
(TO  MAX   SIZE   DESIRED)    AND   IS    (TO   LOG?  OF   MAX   SI7E)    IN   THIS  SUBROU- 
TINE   AND    IN   MATN  PROGRAM, 


IS(1,  1) 
ISP   -  1 
KX  (NSU) 
F    »  1 

L    =  NS 


IS (2,  1 )  -  0 
1.  E99 


WHILE   F   <   L   DO. ..IF   L   -  F   <   e •   DO  8UB*LES0RT,    THEN   oqp   THE  STACK 


10   IF    (   L   -  F  .GE. 
LM1   »   L   -  1 
DO  20   I    -   F,    L m I 
DP   ?0   J   »   I.    L M 1 
I  c    (    K  X  (  J  ♦  1  )    . G  r  < 
XT   -   KX ( T  ) 
KM  I)    -    KXU  +  l) 
KXU  +  l)    -  XT 


8    )   GO   TO  30 


KX ( I )    )   GO  TO  20 


C-27 


20  CONTINUE 

F   «    IS(1»    ISP  ) 

L   »   IS( 2.  ISP) 

ISP  «   ISP  -  1 

I F    (   F    .  L  T  .   L    )    GO   TD  10 

RETURN 

C 

C  ELSE   P4RTTTI0N   THE    APR  AY 

C 

30    I    *   RANF(O.O)    *   FLO  A  T ( L  -  F)   ♦  F 

XF    »   KX ( I  ) 

KX(  I)    »  KX(F) 

KX(F)    «  XF 

I    -  F 
35    I    »    I    ♦  1 

IF    (   KtlL)    . L  T.    XF    )    GO  TO  35 

J    »  L 

TP    (    KX(J)    .  L  E .    XF    )   GO   TO  45 
10   J   .   J  -  I 

IF    (   KX(j)    ,GT.    XF    )    GO  TO  40 
45    IF    {    I    .GE.    J    )   GO   TO  65 
50   XT   «   KX ( I ) 

K X (  I  )    »   KX ( J ) 

KXCJ)    *  XT 
55    I    •    I    ♦  1 

IF    (   K  X (  I  )    .LT.    XF    )   GO   TO  55 
60   J   »   J   -  1 

IF    (    K X (  J )    .GT.    XF    )    GO   TO  60 

IF    (    I    .LT.    J    )   GO  TD  50 
65   KX(F)    .  KX(J) 

<X(J)    «  XF 

C  TAKE    APPROPRIATE    AC  TT  0N#    0€  P  END  TNG  ON   XHT  C  H  »    IF    ANY*    Sl'RARRAYS   Ae  E 

C  NON TRIVIAL 

C 

I<=  (  c  .LT.  J  -  1  1  GO  TO  75 
IF    (    J   ♦   1    .LT.   L    )    GO   TO  70 

C 

r.  BOTH    ARE    TRIVIAL. ..°0P    THE  STACK 

C 

F    »    IS(1*  I<°) 

L   »   IS<2,  ISP) 

ISP   ■   I S 0  -  1 

IF    (    F    .LT.   L    )    GO   TO  10 

RETURN 

C 

C  RIGHT   SU3AR3 AY   ONLY    I  S   N QN-TR I V  I  4 L . . . S 0° T  IT 

70    F    »   J    ♦  1 
GO   TO  10 

r 

75   If  (  J   ♦   1    .LT.    L    )    r-n   TO  30 

r 

C  LFFT   SJBARRAY  ONLY    TS   NON-TR  IVI *L . . . SO* T  IT 

C 

L   «   J   -  1 

GO   TO  10 

C 

C  "*0  TH   SUBARRAYS   ARE  NON-TS I  V  I AL . . . °UT  LARGER   ONE   0N   STAC  K  »    ^  OR  T 

C  OTHER  ONc 

C 

80   IF    (    L   -  J    .LT.    J   -  F    )    GO  TO  85 

ISP   «   ISP   +  1 

IS (1.    ISP)    «   J    ♦  1 

IS(2»    IS°)   *  L 

L   «   J  -  1 

GO   TO  10 
95   TS°    ■   ISP   +  1 

IS(  1»    IS0)    -  F 

IS(2,    ISP)    ■   J   -  I 

F    «    J    ♦  1 

GO  TO  10 
END 

♦  FOR 
*CCF 
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